This dataset is based on “Bank Marketing” UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns).
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe a term deposit (variable y).
library(tidyverse)
library(gmodels)
library(ggmosaic)
library(sjmisc)
library(questionr)
library(corrplot)
library(gridExtra)
library(ggpubr)
library(cowplot)
library(DMwR)
library(caret)
library(e1071)
library(ROCR)
library(plotROC)
library(pROC)
library(rpart)
library(rpart.plot)
library(randomForest)
library(ranger)
library(lightgbm)
library(doMC)
registerDoMC(cores = 10)
select = dplyr::select
slice = dplyr::slice## [1] 41188 21
The dataset has 41,188 rows and 21 columns.
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
The first 20 variables are our potential explanatory variables and the last one (“y”) is the dependent variable.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | no | yes |
## |-----------|-----------|
## | 36548 | 4640 |
## | 0.887 | 0.113 |
## |-----------|-----------|
##
##
##
##
This is an unbalanced two-levels categorical variable, 88.7% of values taken are “no” (or “0”) and only 11.3% of the remaining values are “yes” (or “1”). It is more natural to work with a 0/1 dependent variable:
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 0
## 2 93.994 -36.4 4.857 5191 0
## 3 93.994 -36.4 4.857 5191 0
## 4 93.994 -36.4 4.857 5191 0
## 5 93.994 -36.4 4.857 5191 0
## 6 93.994 -36.4 4.857 5191 0
## [1] 0
There’s no missing value in the dataset. However, according to the data documentation, “unknown” value means NA.
## [1] 12718
There are 12,718 unknown values in the dataset, let’s try to find out which variables suffer the most from those NA values.
bank_data %>%
summarise_all(list(~sum(. == "unknown"))) %>%
gather(key = "variable", value = "nr_unknown") %>%
arrange(-nr_unknown)## variable nr_unknown
## 1 default 8597
## 2 education 1731
## 3 housing 990
## 4 loan 990
## 5 job 330
## 6 marital 80
## 7 age 0
## 8 contact 0
## 9 month 0
## 10 day_of_week 0
## 11 duration 0
## 12 campaign 0
## 13 pdays 0
## 14 previous 0
## 15 poutcome 0
## 16 emp.var.rate 0
## 17 cons.price.idx 0
## 18 cons.conf.idx 0
## 19 euribor3m 0
## 20 nr.employed 0
## 21 y 0
6 features have at least 1 unknown value. Before deciding how to manage those missing values, we’ll study each variable and take a decision after visualisations. We can’t afford to delete 8,597 rows in our dataset, it’s more than 20% of our observations.
Each feature will be checked one at a time. We’ll eventually drop or transform some variables in order to clean up our dataset a little bit.
Those are a few functions that will be useful later.
theme_set(theme_bw())
# setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# setting default parameters for crosstables
fun_crosstable = function(df, var1, var2){
CrossTable(df[, var1], df[, var2],
prop.r = T,
prop.c = F,
prop.t = F,
prop.chisq = F,
dnn = c(var1, var2))
}
# plot weighted lm/leoss regressions with frequencies
fun_gg_freq = function(var){
weight = table(bank_data[, var]) %>%
as.data.frame %>%
mutate(x = as.numeric(as.character(Var1))) %>%
select(-Var1) %>%
rename(weight = Freq)
sink(tempfile())
freq = fun_crosstable(bank_data, var, "y")$prop.r %>%
as.data.frame %>%
mutate(x = as.numeric(as.character(x)))
sink()
both = freq %>%
left_join(weight, by = "x") %>%
filter(weight > 50 & y == 1)
gg = both %>%
ggplot() +
aes(x = x,
y = Freq,
weight = weight) +
geom_point(aes(size = weight)) +
geom_smooth(aes(colour = "blue"), method = "loess") +
geom_smooth(aes(colour = "red"), method = "lm", se = F) +
coord_cartesian(ylim = c(-0.1, 1)) +
theme(plot.margin = unit(c(0, 0, 0, 0), "pt")) +
xlab(var) +
ylab("") +
scale_x_continuous(position = "top") +
scale_colour_manual(values = c("blue", "red"),
labels = c("loess", "lm")) +
labs(colour = "Regression")
return(gg)
}
# re-ordering levels from factor variable
fun_reorder_levels = function(df, variable, first){
remaining = unique(df[, variable])[which(unique(df[, variable]) != first)]
x = factor(df[, variable], levels = c(first, remaining))
return(x)
}
# plotting importance from predictive models into two panels
fun_imp_ggplot_split = function(model){
if (class(model)[1] == "ranger"){
imp_df = model$variable.importance %>%
data.frame("Overall" = .) %>%
rownames_to_column() %>%
rename(variable = rowname) %>%
arrange(-Overall)
} else {
imp_df = varImp(model) %>%
rownames_to_column() %>%
rename(variable = rowname) %>%
arrange(-Overall)
}
gg1 = imp_df %>%
slice(1:floor(nrow(.)/2)) %>%
ggplot() +
aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
geom_bar() +
coord_flip() +
xlab("Variables") +
ylab("Importance") +
theme(legend.position = "none")
imp_range = ggplot_build(gg1)[["layout"]][["panel_params"]][[1]][["x.range"]]
imp_gradient = scale_fill_gradient(limits = c(-imp_range[2], -imp_range[1]),
low = "#132B43",
high = "#56B1F7")
gg2 = imp_df %>%
slice(floor(nrow(.)/2)+1:nrow(.)) %>%
ggplot() +
aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
geom_bar() +
coord_flip() +
xlab("") +
ylab("Importance") +
theme(legend.position = "none") +
ylim(imp_range) +
imp_gradient
gg_both = plot_grid(gg1 + imp_gradient,
gg2)
return(gg_both)
}
# plotting two performance measures
fun_gg_cutoff = function(score, obs, measure1, measure2) {
predictions = prediction(score, obs)
performance1 = performance(predictions, measure1)
performance2 = performance(predictions, measure2)
df1 = data.frame(x = performance1@x.values[[1]],
y = performance1@y.values[[1]],
measure = measure1,
stringsAsFactors = F)
df2 = data.frame(x = performance2@x.values[[1]],
y = performance2@y.values[[1]],
measure = measure2,
stringsAsFactors = F)
df = df1 %>%
bind_rows(df2)
gg = df %>%
ggplot() +
aes(x = x,
y = y,
colour = measure) +
geom_line() +
xlim(c(0, 1)) +
ylab("") +
xlab("Threshold")
return(gg)
}
# creating classes according to score and cut
fun_cut_predict = function(score, cut) {
classes = score
classes[classes > cut] = 1
classes[classes <= cut] = 0
classes = as.factor(classes)
return(classes)
}## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 32.00 38.00 40.02 47.00 98.00
Ages range from 17 to 98, there doesn’t seem anything strange from there. The other summary statistics are fine, the average is 40 years old.
bank_data %>%
ggplot() +
aes(x = age) +
geom_bar() +
geom_vline(xintercept = c(30, 60),
col = "red",
linetype = "dashed") +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 100, 5))After the 60-years threshold, the relative frequency is higher when y = 1. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.
bank_data %>%
mutate(elder60 = if_else(age > 60, "1", "0")) %>%
group_by(y) %>%
add_count(nr_y = n()) %>%
group_by(elder60, y) %>%
summarise(abs_freq = n(),
relative_freq = round(100*n()/first(nr_y), 2))## # A tibble: 4 x 4
## # Groups: elder60 [2]
## elder60 y abs_freq relative_freq
## <chr> <fct> <int> <dbl>
## 1 0 0 36052 98.6
## 2 0 1 4226 91.1
## 3 1 0 496 1.36
## 4 1 1 414 8.92
Elderly persons represent 8.92% of observations who accepted to subscribe to a term deposit, this proportion decreases to 1.36% for non subscribers.
We can also slice at 30 years to make three easily interpretable classes : [0, 30[, [30, 60[ and [60, +Inf[. The minimum and maximum values are 17 and 98 but we can expect new observations outside this range. We’re replacing the continious variable “age” by this categorical variable.
Let’s cross it with our dependent variable.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | y
## age | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## high | 496 | 414 | 910 |
## | 0.545 | 0.455 | 0.022 |
## -------------|-----------|-----------|-----------|
## low | 6259 | 1124 | 7383 |
## | 0.848 | 0.152 | 0.179 |
## -------------|-----------|-----------|-----------|
## mid | 29793 | 3102 | 32895 |
## | 0.906 | 0.094 | 0.799 |
## -------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## -------------|-----------|-----------|-----------|
##
##
45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).
##
## admin. blue-collar entrepreneur housemaid management
## 10422 9254 1456 1060 2924
## retired self-employed services student technician
## 1720 1421 3969 875 6743
## unemployed unknown
## 1014 330
There are mainly admin. (white-collars?) and blue-collars in the dataset. We can notice 330 “unknown” values.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | y
## job | 0 | 1 | Row Total |
## --------------|-----------|-----------|-----------|
## admin. | 9070 | 1352 | 10422 |
## | 0.870 | 0.130 | 0.253 |
## --------------|-----------|-----------|-----------|
## blue-collar | 8616 | 638 | 9254 |
## | 0.931 | 0.069 | 0.225 |
## --------------|-----------|-----------|-----------|
## entrepreneur | 1332 | 124 | 1456 |
## | 0.915 | 0.085 | 0.035 |
## --------------|-----------|-----------|-----------|
## housemaid | 954 | 106 | 1060 |
## | 0.900 | 0.100 | 0.026 |
## --------------|-----------|-----------|-----------|
## management | 2596 | 328 | 2924 |
## | 0.888 | 0.112 | 0.071 |
## --------------|-----------|-----------|-----------|
## retired | 1286 | 434 | 1720 |
## | 0.748 | 0.252 | 0.042 |
## --------------|-----------|-----------|-----------|
## self-employed | 1272 | 149 | 1421 |
## | 0.895 | 0.105 | 0.035 |
## --------------|-----------|-----------|-----------|
## services | 3646 | 323 | 3969 |
## | 0.919 | 0.081 | 0.096 |
## --------------|-----------|-----------|-----------|
## student | 600 | 275 | 875 |
## | 0.686 | 0.314 | 0.021 |
## --------------|-----------|-----------|-----------|
## technician | 6013 | 730 | 6743 |
## | 0.892 | 0.108 | 0.164 |
## --------------|-----------|-----------|-----------|
## unemployed | 870 | 144 | 1014 |
## | 0.858 | 0.142 | 0.025 |
## --------------|-----------|-----------|-----------|
## unknown | 293 | 37 | 330 |
## | 0.888 | 0.112 | 0.008 |
## --------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## --------------|-----------|-----------|-----------|
##
##
The “unknown” level doesn’t show any important information and should be discarded from the data. We’ll remove rows containing this value in the “job” column.
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, job), fill = y)) +
mosaic_theme +
xlab("Job") +
ylab(NULL)Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription. Other levels range between 6.9% (blue-collar) and 13.0% (admin.).
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40858
##
##
## | y
## marital | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## divorced | 4126 | 473 | 4599 |
## | 0.897 | 0.103 | 0.113 |
## -------------|-----------|-----------|-----------|
## married | 22178 | 2516 | 24694 |
## | 0.898 | 0.102 | 0.604 |
## -------------|-----------|-----------|-----------|
## single | 9889 | 1605 | 11494 |
## | 0.860 | 0.140 | 0.281 |
## -------------|-----------|-----------|-----------|
## unknown | 62 | 9 | 71 |
## | 0.873 | 0.127 | 0.002 |
## -------------|-----------|-----------|-----------|
## Column Total | 36255 | 4603 | 40858 |
## -------------|-----------|-----------|-----------|
##
##
For the same reasons as before, we’ll remove rows with “unknown” as value for this variable.
Celibates slightly subscribe more often (14.0%) to term deposits than others (divorced (10.3%) and married (10.2%)).
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, marital), fill = y)) +
mosaic_theme +
xlab("Marital status") +
ylab(NULL)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40787
##
##
## | y
## education | 0 | 1 | Row Total |
## --------------------|-----------|-----------|-----------|
## basic.4y | 3695 | 423 | 4118 |
## | 0.897 | 0.103 | 0.101 |
## --------------------|-----------|-----------|-----------|
## basic.6y | 2077 | 187 | 2264 |
## | 0.917 | 0.083 | 0.056 |
## --------------------|-----------|-----------|-----------|
## basic.9y | 5536 | 470 | 6006 |
## | 0.922 | 0.078 | 0.147 |
## --------------------|-----------|-----------|-----------|
## high.school | 8436 | 1028 | 9464 |
## | 0.891 | 0.109 | 0.232 |
## --------------------|-----------|-----------|-----------|
## illiterate | 14 | 4 | 18 |
## | 0.778 | 0.222 | 0.000 |
## --------------------|-----------|-----------|-----------|
## professional.course | 4631 | 594 | 5225 |
## | 0.886 | 0.114 | 0.128 |
## --------------------|-----------|-----------|-----------|
## university.degree | 10442 | 1654 | 12096 |
## | 0.863 | 0.137 | 0.297 |
## --------------------|-----------|-----------|-----------|
## unknown | 1362 | 234 | 1596 |
## | 0.853 | 0.147 | 0.039 |
## --------------------|-----------|-----------|-----------|
## Column Total | 36193 | 4594 | 40787 |
## --------------------|-----------|-----------|-----------|
##
##
The illiterate category has not enough observations to be statisticaly meaningful. We can’t discriminate illiterate people by using a pool made of 18 individuals only. Hence, those rows will be deleted from the dataset.
Among the 1,596 rows containing the “unknown” value, 234 of them subscribed to a term deposit. This is around 5% of the total group of subscribers. Since we’re facing a very unbalanced dependent variable situation, we can not afford to discard those rows. Since this category has the highest relative frequency of “y = 1” (14.7%), we’re going to add them in the “university.degree” level. This level has the second highest “y = 1” relative frequency (13.7%).
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, education), fill = y)) +
mosaic_theme +
xlab("Education") +
ylab(NULL)It seems to be a positive correlation between the number of years of education and the odds to subscribe to a term deposit.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## default | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 28182 | 4155 | 32337 |
## | 0.872 | 0.128 | 0.793 |
## -------------|-----------|-----------|-----------|
## unknown | 7994 | 435 | 8429 |
## | 0.948 | 0.052 | 0.207 |
## -------------|-----------|-----------|-----------|
## yes | 3 | 0 | 3 |
## | 1.000 | 0.000 | 0.000 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
This feature is certainly not usable. Only 3 individuals replied “yes” to the question “Do you have credit in default?”. People either answer “no” (79.3%) or don’t even reply (20.7%), which gives zero information in our case. This variable is removed from the dataset.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## housing | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 16416 | 2003 | 18419 |
## | 0.891 | 0.109 | 0.452 |
## -------------|-----------|-----------|-----------|
## unknown | 877 | 107 | 984 |
## | 0.891 | 0.109 | 0.024 |
## -------------|-----------|-----------|-----------|
## yes | 18886 | 2480 | 21366 |
## | 0.884 | 0.116 | 0.524 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
##
## Pearson's Chi-squared test
##
## data: bank_data$housing and bank_data$y
## X-squared = 5.4627, df = 2, p-value = 0.06513
The p-value associated to the Chi-squared test equals to 0.065, which is higher than a 0.05-threshold. So, for a confidence level of 95%, there’s no association between the dependent variable y and our feature housing. We’re removing it from the dataset.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## loan | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 29799 | 3806 | 33605 |
## | 0.887 | 0.113 | 0.824 |
## -------------|-----------|-----------|-----------|
## unknown | 877 | 107 | 984 |
## | 0.891 | 0.109 | 0.024 |
## -------------|-----------|-----------|-----------|
## yes | 5503 | 677 | 6180 |
## | 0.890 | 0.110 | 0.152 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
##
## Pearson's Chi-squared test
##
## data: bank_data$loan and bank_data$y
## X-squared = 0.86841, df = 2, p-value = 0.6478
The p-value associated to the Chi-squared test equals to 0.648, which is higher than a 0.01-threshold. So, for a confidence level of 99%, there’s no association between the dependent variable y and our feature loan. We’re also removing it from the dataset.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## contact | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## cellular | 22098 | 3815 | 25913 |
## | 0.853 | 0.147 | 0.636 |
## -------------|-----------|-----------|-----------|
## telephone | 14081 | 775 | 14856 |
## | 0.948 | 0.052 | 0.364 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
This feature is really interesting, 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did. I can’t see any good reason to explain this.
Recoding levels makes things easier.
month_recode = c("jan" = "(01)jan",
"feb" = "(02)feb",
"mar" = "(03)mar",
"apr" = "(04)apr",
"may" = "(05)may",
"jun" = "(06)jun",
"jul" = "(07)jul",
"aug" = "(08)aug",
"sep" = "(09)sep",
"oct" = "(10)oct",
"nov" = "(11)nov",
"dec" = "(12)dec")
bank_data = bank_data %>%
mutate(month = recode(month, !!!month_recode))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## month | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## (03)mar | 267 | 274 | 541 |
## | 0.494 | 0.506 | 0.013 |
## -------------|-----------|-----------|-----------|
## (04)apr | 2082 | 536 | 2618 |
## | 0.795 | 0.205 | 0.064 |
## -------------|-----------|-----------|-----------|
## (05)may | 12734 | 882 | 13616 |
## | 0.935 | 0.065 | 0.334 |
## -------------|-----------|-----------|-----------|
## (06)jun | 4697 | 548 | 5245 |
## | 0.896 | 0.104 | 0.129 |
## -------------|-----------|-----------|-----------|
## (07)jul | 6471 | 642 | 7113 |
## | 0.910 | 0.090 | 0.174 |
## -------------|-----------|-----------|-----------|
## (08)aug | 5459 | 644 | 6103 |
## | 0.894 | 0.106 | 0.150 |
## -------------|-----------|-----------|-----------|
## (09)sep | 309 | 253 | 562 |
## | 0.550 | 0.450 | 0.014 |
## -------------|-----------|-----------|-----------|
## (10)oct | 396 | 311 | 707 |
## | 0.560 | 0.440 | 0.017 |
## -------------|-----------|-----------|-----------|
## (11)nov | 3672 | 412 | 4084 |
## | 0.899 | 0.101 | 0.100 |
## -------------|-----------|-----------|-----------|
## (12)dec | 92 | 88 | 180 |
## | 0.511 | 0.489 | 0.004 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
aes(x = month, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")First of all, we can notice that no contact has been made during January and February. The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%). Every month with a very low frequency of contact (march, september, october and december) show very good results (between 44% and 51% of subscribers). Except maybe for december, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.
For the same reason as before, we’re recoding the days of the week.
day_recode = c("mon" = "(01)mon",
"tue" = "(02)tue",
"wed" = "(03)wed",
"thu" = "(04)thu",
"fri" = "(05)fri")
bank_data = bank_data %>%
mutate(day_of_week = recode(day_of_week, !!!day_recode))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | y
## day_of_week | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## (01)mon | 7578 | 841 | 8419 |
## | 0.900 | 0.100 | 0.207 |
## -------------|-----------|-----------|-----------|
## (02)tue | 7056 | 945 | 8001 |
## | 0.882 | 0.118 | 0.196 |
## -------------|-----------|-----------|-----------|
## (03)wed | 7116 | 934 | 8050 |
## | 0.884 | 0.116 | 0.197 |
## -------------|-----------|-----------|-----------|
## (04)thu | 7493 | 1031 | 8524 |
## | 0.879 | 0.121 | 0.209 |
## -------------|-----------|-----------|-----------|
## (05)fri | 6936 | 839 | 7775 |
## | 0.892 | 0.108 | 0.191 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
Calls aren’t made during weekend days. If calls are evenly distributed between the different week days, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 10.0% of successful calls. However, those differences are small, which makes this feature not that important. It would’ve been interesting to see the attitude of responders from weekend calls.
Since the goal is to seek best candidates who will have the best odds to subscribe to a term deposit, the call duration can’t be predicted before. So this feature is removed as recommended.
bank_data %>%
ggplot() +
aes(x = campaign) +
geom_bar() +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 50, 5))Calling more than ten times a same person during a single marketing campaign seems excessive. We’ll consider those as outliers, even if marketing harrassment a real thing. However, we can see that on the chart that harassment isn’t working at all.
bank_data %>%
ggplot() +
aes(x = campaign) +
geom_bar() +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 10, 1))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 39912
##
##
## | y
## campaign | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 15176 | 2268 | 17444 |
## | 0.870 | 0.130 | 0.437 |
## -------------|-----------|-----------|-----------|
## 2 | 9280 | 1205 | 10485 |
## | 0.885 | 0.115 | 0.263 |
## -------------|-----------|-----------|-----------|
## 3 | 4722 | 569 | 5291 |
## | 0.892 | 0.108 | 0.133 |
## -------------|-----------|-----------|-----------|
## 4 | 2380 | 246 | 2626 |
## | 0.906 | 0.094 | 0.066 |
## -------------|-----------|-----------|-----------|
## 5 | 1466 | 120 | 1586 |
## | 0.924 | 0.076 | 0.040 |
## -------------|-----------|-----------|-----------|
## 6 | 892 | 74 | 966 |
## | 0.923 | 0.077 | 0.024 |
## -------------|-----------|-----------|-----------|
## 7 | 583 | 38 | 621 |
## | 0.939 | 0.061 | 0.016 |
## -------------|-----------|-----------|-----------|
## 8 | 378 | 16 | 394 |
## | 0.959 | 0.041 | 0.010 |
## -------------|-----------|-----------|-----------|
## 9 | 260 | 17 | 277 |
## | 0.939 | 0.061 | 0.007 |
## -------------|-----------|-----------|-----------|
## 10 | 211 | 11 | 222 |
## | 0.950 | 0.050 | 0.006 |
## -------------|-----------|-----------|-----------|
## Column Total | 35348 | 4564 | 39912 |
## -------------|-----------|-----------|-----------|
##
##
Let’s extract the row proportions when y = 1.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 39912
##
##
## | y
## campaign | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 15176 | 2268 | 17444 |
## | 0.870 | 0.130 | 0.437 |
## -------------|-----------|-----------|-----------|
## 2 | 9280 | 1205 | 10485 |
## | 0.885 | 0.115 | 0.263 |
## -------------|-----------|-----------|-----------|
## 3 | 4722 | 569 | 5291 |
## | 0.892 | 0.108 | 0.133 |
## -------------|-----------|-----------|-----------|
## 4 | 2380 | 246 | 2626 |
## | 0.906 | 0.094 | 0.066 |
## -------------|-----------|-----------|-----------|
## 5 | 1466 | 120 | 1586 |
## | 0.924 | 0.076 | 0.040 |
## -------------|-----------|-----------|-----------|
## 6 | 892 | 74 | 966 |
## | 0.923 | 0.077 | 0.024 |
## -------------|-----------|-----------|-----------|
## 7 | 583 | 38 | 621 |
## | 0.939 | 0.061 | 0.016 |
## -------------|-----------|-----------|-----------|
## 8 | 378 | 16 | 394 |
## | 0.959 | 0.041 | 0.010 |
## -------------|-----------|-----------|-----------|
## 9 | 260 | 17 | 277 |
## | 0.939 | 0.061 | 0.007 |
## -------------|-----------|-----------|-----------|
## 10 | 211 | 11 | 222 |
## | 0.950 | 0.050 | 0.006 |
## -------------|-----------|-----------|-----------|
## Column Total | 35348 | 4564 | 39912 |
## -------------|-----------|-----------|-----------|
##
##
prop_row %>%
ggplot() +
aes(x = x,
y = Freq) +
geom_point() +
geom_hline(yintercept = 0.085,
col = "red")There’s no particular pattern depending the different values of Campaign. We’ll lose a lot of information if we’re binning this variable. Hence, it’ll kept as a categorical variable with 10 levels.
All the variables are kept as character or numeric for now, we’ll change them to factors if needed later.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 15 25 59 431 116 46 403 60 17 63 52 27
## 12 13 14 15 16 17 18 19 20 21 22 25
## 58 35 19 24 10 8 7 3 1 2 3 1
## 26 27 999
## 1 1 38425
This is the number of days that passed by after the client was last contacted from a previous campaign. 999 value means the client wasn’t previously contacted. Let’s make a dummy out of it.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 39912
##
##
## | y
## pdays_dummy | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 34817 | 3608 | 38425 |
## | 0.906 | 0.094 | 0.963 |
## -------------|-----------|-----------|-----------|
## 1 | 531 | 956 | 1487 |
## | 0.357 | 0.643 | 0.037 |
## -------------|-----------|-----------|-----------|
## Column Total | 35348 | 4564 | 39912 |
## -------------|-----------|-----------|-----------|
##
##
Recontacting a client after a previous campaign seems to highly increase the odds of subscription.
##
## 0 1 2 3 4 5 6 7
## 34353 4513 738 214 70 18 5 1
This is the number of contacts performed before this campaign and for each client.
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(previous), fill = y)) +
mosaic_theme +
xlab("Previous") +
ylab(NULL)If we make a dummy out of this, depending if the client has been contacted in a previous compaign (1, 2, …) or not (0), this variable will exactly have the same information than pdays_dummy. We can’t keep this variable without binning modalities because some levels show way not enough observations. The best we can do is to make 3 levels out of this.
bank_data = bank_data %>%
mutate(previous = if_else(previous >= 2, "2+", if_else(previous == 1, "1", "0")))This variable will still be highly correlated to pdays_dummy, because someone can be contacted more than once in a same compaign. It won’t be surprising if we’ll have to drop one of those variables.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 39912
##
##
## | y
## previous | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 31268 | 3085 | 34353 |
## | 0.910 | 0.090 | 0.861 |
## -------------|-----------|-----------|-----------|
## 1 | 3556 | 957 | 4513 |
## | 0.788 | 0.212 | 0.113 |
## -------------|-----------|-----------|-----------|
## 2+ | 524 | 522 | 1046 |
## | 0.501 | 0.499 | 0.026 |
## -------------|-----------|-----------|-----------|
## Column Total | 35348 | 4564 | 39912 |
## -------------|-----------|-----------|-----------|
##
##
As the analysis of the pdays_dummy variable has shown, recontacting someone again will increase the odds. Can we expect that long term harassment works unlike short terme harassment?
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 39912
##
##
## | y
## poutcome | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## failure | 3616 | 595 | 4211 |
## | 0.859 | 0.141 | 0.106 |
## -------------|-----------|-----------|-----------|
## nonexistent | 31268 | 3085 | 34353 |
## | 0.910 | 0.090 | 0.861 |
## -------------|-----------|-----------|-----------|
## success | 464 | 884 | 1348 |
## | 0.344 | 0.656 | 0.034 |
## -------------|-----------|-----------|-----------|
## Column Total | 35348 | 4564 | 39912 |
## -------------|-----------|-----------|-----------|
##
##
65.6% of people who already subscribed to a term deposit after a previous contact have accepted to do it again. Even if they refused before, they’re still more enthusiastic to accept it (14.1%) than people who haven’t been contacted before (9.0%). So even if the previous campaign was a failure, recontacting people is important.
So far, the social and economic context attributes are remaining. Since those indicators are highly correlated together, we’ll study those by pairs (bivariate study).
Those five continious variables are social and economic indicators. They’re supposed to be highly correlated. Let’s compute the correlation matrix.
bank_data %>%
select(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed) %>%
cor() %>%
corrplot(method = "number",
type = "upper",
tl.cex = 0.8,
tl.srt = 45,
tl.col = "black")As expected, three pairs show a correlation coefficient higher than 0.90 which is way too much. Our indicators are too correlated and share redundant information. Let’s figure out which variable(s) can be removed to lighten this correlation matrix.
gg_emp.var.rate = fun_gg_freq("emp.var.rate")
gg_cons.price.idx = fun_gg_freq("cons.price.idx")
gg_cons.conf.idx = fun_gg_freq("cons.conf.idx")
gg_euribor3m = fun_gg_freq("euribor3m")
gg_nr.employed = fun_gg_freq("nr.employed")
plot_grid(gg_emp.var.rate + theme(legend.position = "non") + ylab("Frequency"),
gg_cons.price.idx + theme(legend.position = "none"),
gg_cons.conf.idx + theme(legend.position = "none"),
gg_euribor3m + theme(legend.position = "none"),
gg_nr.employed + theme(legend.position = "none"),
get_legend(gg_cons.conf.idx),
align = "vh")emp.var.rate isn’t meaningful. We’re removing it to soften correlations between those 5 variables.
bank_data %>%
select(cons.price.idx, cons.conf.idx, euribor3m, nr.employed) %>%
cor() %>%
corrplot(method = "number",
type = "full",
tl.cex = 0.8,
tl.srt = 45,
tl.col = "black")Even if there’s still a high correlation between two variables : euribor3m and nr.employed (0.94), we’re keeping both features. This is a spurious correlation, bank size (number of employees) isn’t reactive to the euribor rate.
Let’s continue the bivariate analysis by crossing our features with our predicted variable.
First, we’ll split character features and non-character features (numeric or factor) to use appropriate tests (anova for numeric features and chi squared test (Cramer’s V coefficient) for discrete variables).
bank_data_x_dbl = bank_data %>%
select_if(~{is.double(.) | is.factor(.)})
bank_data_x_chr = bank_data %>%
select_if(~is.character(.))Let’s start with some anova.
## Df Sum Sq Mean Sq F value Pr(>F)
## y 1 238 238.41 724.1 <2e-16 ***
## Residuals 39910 13140 0.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## y 1 2598 2598.1 120.5 <2e-16 ***
## Residuals 39910 860687 21.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## y 1 11401 11401 4154 <2e-16 ***
## Residuals 39910 109537 3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## y 1 26374821 26374821 5738 <2e-16 ***
## Residuals 39910 183451532 4597
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All four remaining variables have their values statisticaly contrasted by the y response. This is good.
The Cramer’s V is the standardized value of the chi-squared test statistic and is calculated as follows: \[V = \sqrt{\frac{\chi^2}{n \cdot (\min (r, c) -1)}}\] whereby \(n\) is the size of the sample, \(r\) is the number of rows in our contingency table and \(c\) is the number of columns. Since one of the two inputs of the contingency table is our predicted variable, which has only two levels, \(\min (r,c) - 1\) is neccessarily equal to 1. So the formula is simply (in this case):
\[V = \sqrt{\frac{\chi^2}{n}}\]
cramer = data.frame(NA, ncol(bank_data_x_chr), 3)
for (i in (1:ncol(bank_data_x_chr))){
tab = table(bank_data_x_chr[, i], bank_data$y)
chisq_results = chisq.test(tab)
cramer[i, 1] = names(bank_data_x_chr)[i]
# cramer[i, 2] = round(sqrt(chisq_results$statistic/(nrow(bank_data_x_chr) * (min(dim(tab)) -1))), 3)
cramer[i, 2] = round(sqrt(chisq_results$statistic/(nrow(bank_data_x_chr))), 3)
cramer[i, 3] = signif(chisq_results$p.value, 3)
}
colnames(cramer) = c("variable", "cramerv", "pvalue_chi2")cramer %>%
arrange(-cramerv) %>%
ggplot() +
aes(x = reorder(variable, -cramerv),
y = cramerv,
fill = -cramerv) +
geom_bar(stat = "identity",
show.legend = F) +
xlab("Variable") +
ylab("Cramer's V") +
ggtitle("Cramer's V against the explained response") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
scale_fill_gradient(high = "pink",
low = "darkred")High V value means an important dependancy between the feature and the y variable.
This time, input can both have more than two levels, so the default formula is used: \[V = \sqrt{\frac{\chi^2}{n \cdot (\min (r, c) -1)}}\]
cramer = matrix(NA, ncol(bank_data_x_chr), ncol(bank_data_x_chr))
for (i in (1:ncol(bank_data_x_chr))){
for (j in (1:ncol(bank_data_x_chr))){
tab = table(bank_data_x_chr[, i], bank_data_x_chr[, j])
chisq_results = chisq.test(tab)
cramer[i, j] = sqrt(chisq_results$statistic/(nrow(bank_data_x_chr) * (min(dim(tab)) -1)))
}
}
cramer = round(cramer, 3)
colnames(cramer) = colnames(bank_data_x_chr)
rownames(cramer) = colnames(bank_data_x_chr)corrplot(cramer,
method = "shade",
type = "upper",
diag = F,
tl.srt = 45,
tl.col = "black",
tl.cex = 0.6,
addCoef.col = "darkgreen",
addCoefasPercent = T)Except of the pdays_dummy and poutcome variables, the standardized chi squared statistic isn’t showing important dependecy between features. We’re arbitrarily keeping both variables.
So far, we’ve :
Now we’ve selected our features, it is useful to transform every character variable to factor variables for interpretation purposes for the logistic regressions. This function also picks the reference value for comparisons (reference group).
bank_data$age = fun_reorder_levels(bank_data, "age", "low")
bank_data$job = fun_reorder_levels(bank_data, "job", "unemployed")
bank_data$marital = fun_reorder_levels(bank_data, "marital", "single")
bank_data$education = fun_reorder_levels(bank_data, "education", "basic.4y")
bank_data$contact = fun_reorder_levels(bank_data, "contact", "telephone")
bank_data$month = fun_reorder_levels(bank_data, "month", "(03)mar")
bank_data$day_of_week = fun_reorder_levels(bank_data, "day_of_week", "(01)mon")
bank_data$campaign = fun_reorder_levels(bank_data, "campaign", "1")
bank_data$previous = fun_reorder_levels(bank_data, "previous", "0")
bank_data$poutcome = fun_reorder_levels(bank_data, "poutcome", "nonexistent")
bank_data$pdays_dummy = fun_reorder_levels(bank_data, "pdays_dummy", "0")Our dataset is ready, the only step remaining is splitting our data into training and validation sets. Since we’re going to test different sampling methods, the training set will change between methods but the test set will remain the same to compare the different models. Sampling will be made with the caret package, except for the SMOTE, which will be made with the DMwR package.
This is the default sampling, no change needs to be made.
Let’s start out with logistic regression, decision trees then random forests (through the ranger package).
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_ns)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2102 -0.3959 -0.3254 -0.2542 2.8461
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 58.3815416 24.7913690 2.355 0.018527 *
## agemid -0.1596297 0.0565249 -2.824 0.004742 **
## agehigh 0.0638445 0.1330406 0.480 0.631308
## jobhousemaid -0.0887968 0.1823224 -0.487 0.626236
## jobservices -0.1204328 0.1411130 -0.853 0.393410
## jobadmin. 0.0564314 0.1264671 0.446 0.655443
## jobblue-collar -0.1505559 0.1326503 -1.135 0.256382
## jobtechnician 0.0413816 0.1319284 0.314 0.753774
## jobretired 0.1857167 0.1593447 1.166 0.243815
## jobmanagement -0.0018819 0.1431125 -0.013 0.989508
## jobself-employed -0.0235987 0.1635393 -0.144 0.885264
## jobentrepreneur 0.0544228 0.1632784 0.333 0.738898
## jobstudent 0.2336432 0.1567528 1.491 0.136087
## maritalmarried 0.0249895 0.0510480 0.490 0.624467
## maritaldivorced 0.0574781 0.0753998 0.762 0.445875
## educationhigh.school 0.0517663 0.0896447 0.577 0.563628
## educationbasic.6y 0.1177217 0.1166024 1.010 0.312687
## educationbasic.9y -0.0137289 0.0924755 -0.148 0.881980
## educationprofessional.course 0.0848380 0.0988428 0.858 0.390720
## educationuniversity.degree 0.0977261 0.0873560 1.119 0.263264
## contactcellular 0.5115313 0.0686044 7.456 8.90e-14 ***
## month(05)may -1.4730158 0.1222787 -12.046 < 2e-16 ***
## month(06)jun -0.5294714 0.1473883 -3.592 0.000328 ***
## month(07)jul -0.6060177 0.1349348 -4.491 7.08e-06 ***
## month(08)aug -0.9763408 0.1346840 -7.249 4.19e-13 ***
## month(10)oct -1.0005521 0.1549178 -6.459 1.06e-10 ***
## month(11)nov -1.1413269 0.1317422 -8.663 < 2e-16 ***
## month(12)dec -0.7347432 0.2157858 -3.405 0.000662 ***
## month(04)apr -0.7560740 0.1332616 -5.674 1.40e-08 ***
## month(09)sep -1.3377130 0.1620985 -8.252 < 2e-16 ***
## day_of_week(02)tue 0.2425016 0.0648050 3.742 0.000183 ***
## day_of_week(03)wed 0.3439843 0.0643322 5.347 8.94e-08 ***
## day_of_week(04)thu 0.2825023 0.0626887 4.506 6.59e-06 ***
## day_of_week(05)fri 0.2342640 0.0650205 3.603 0.000315 ***
## campaign2 -0.0294343 0.0487748 -0.603 0.546194
## campaign3 0.0804110 0.0625806 1.285 0.198821
## campaign4 -0.0050193 0.0868835 -0.058 0.953931
## campaign5 -0.2364392 0.1181480 -2.001 0.045369 *
## campaign6 -0.2176382 0.1514567 -1.437 0.150728
## campaign7 -0.4332961 0.2061754 -2.102 0.035589 *
## campaign8 -0.5788286 0.2818325 -2.054 0.039995 *
## campaign9 -0.2435047 0.3313750 -0.735 0.462442
## campaign10 -0.1347225 0.3359390 -0.401 0.688396
## previous1 0.2579275 0.2251453 1.146 0.251959
## previous2+ 0.1679809 0.2724129 0.617 0.537472
## poutcomefailure -0.7651163 0.2204643 -3.470 0.000520 ***
## poutcomesuccess NA NA NA NA
## cons.price.idx -0.0281799 0.1368159 -0.206 0.836815
## cons.conf.idx 0.0193871 0.0078440 2.472 0.013452 *
## euribor3m -0.0007667 0.1333870 -0.006 0.995414
## nr.employed -0.0110117 0.0025492 -4.320 1.56e-05 ***
## pdays_dummy1 1.1447030 0.2241324 5.107 3.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22707 on 31930 degrees of freedom
## Residual deviance: 17776 on 31880 degrees of freedom
## AIC: 17878
##
## Number of Fisher Scoring iterations: 6
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
measure_train_ns = fun_gg_cutoff(logistic_train_score_ns, bank_train_ns$y,
"acc", "sens")
measure_train_ns +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")logistic_train_class_ns = fun_cut_predict(logistic_train_score_ns, 0.2)
# matrix
logistic_train_confm_ns = confusionMatrix(logistic_train_class_ns, bank_train_ns$y,
positive = "1")
logistic_train_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25581 1587
## 1 2698 2065
##
## Accuracy : 0.8658
## 95% CI : (0.862, 0.8695)
## No Information Rate : 0.8856
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4151
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.56544
## Specificity : 0.90459
## Pos Pred Value : 0.43355
## Neg Pred Value : 0.94159
## Prevalence : 0.11437
## Detection Rate : 0.06467
## Detection Prevalence : 0.14917
## Balanced Accuracy : 0.73502
##
## 'Positive' Class : 1
##
measure_test_ns = fun_gg_cutoff(logistic_test_score_ns, bank_test_ns$y,
"acc", "sens")
measure_test_ns +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")logistic_test_class_ns = fun_cut_predict(logistic_test_score_ns, 0.2)
# matrix
logistic_test_confm_ns = confusionMatrix(logistic_test_class_ns, bank_test_ns$y,
positive = "1")
logistic_test_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6399 420
## 1 670 492
##
## Accuracy : 0.8634
## 95% CI : (0.8557, 0.8709)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3973
##
## Mcnemar's Test P-Value : 4.628e-14
##
## Sensitivity : 0.53947
## Specificity : 0.90522
## Pos Pred Value : 0.42341
## Neg Pred Value : 0.93841
## Prevalence : 0.11427
## Detection Rate : 0.06165
## Detection Prevalence : 0.14560
## Balanced Accuracy : 0.72235
##
## 'Positive' Class : 1
##
logistic_ns_2 = glm(y ~ . - job - marital - education - previous - euribor3m - cons.conf.idx - campaign,
data = bank_train_ns,
family = "binomial")##
## Call:
## glm(formula = y ~ . - job - marital - education - previous -
## euribor3m - cons.conf.idx - campaign, family = "binomial",
## data = bank_train_ns)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1363 -0.3951 -0.3346 -0.2382 2.7988
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 65.9168072 3.8313927 17.204 < 2e-16 ***
## agemid -0.1769233 0.0487356 -3.630 0.000283 ***
## agehigh 0.1785377 0.0950352 1.879 0.060293 .
## contactcellular 0.4059889 0.0610789 6.647 2.99e-11 ***
## month(05)may -1.5308954 0.1143478 -13.388 < 2e-16 ***
## month(06)jun -0.4872326 0.1247922 -3.904 9.45e-05 ***
## month(07)jul -0.5316558 0.1246735 -4.264 2.00e-05 ***
## month(08)aug -0.7590794 0.1211852 -6.264 3.76e-10 ***
## month(10)oct -0.8283920 0.1417583 -5.844 5.11e-09 ***
## month(11)nov -1.0477606 0.1256554 -8.338 < 2e-16 ***
## month(12)dec -0.5671038 0.2090464 -2.713 0.006671 **
## month(04)apr -0.8061365 0.1194677 -6.748 1.50e-11 ***
## month(09)sep -1.1521548 0.1480853 -7.780 7.23e-15 ***
## day_of_week(02)tue 0.2561945 0.0643412 3.982 6.84e-05 ***
## day_of_week(03)wed 0.3487747 0.0640040 5.449 5.06e-08 ***
## day_of_week(04)thu 0.2890088 0.0621679 4.649 3.34e-06 ***
## day_of_week(05)fri 0.2376569 0.0647390 3.671 0.000242 ***
## poutcomefailure -0.5213575 0.0631761 -8.252 < 2e-16 ***
## poutcomesuccess 0.2911045 0.2184294 1.333 0.182625
## cons.price.idx -0.0954891 0.0391932 -2.436 0.014836 *
## nr.employed -0.0113909 0.0003334 -34.164 < 2e-16 ***
## pdays_dummy1 1.1133871 0.2065204 5.391 7.00e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22707 on 31930 degrees of freedom
## Residual deviance: 17845 on 31909 degrees of freedom
## AIC: 17889
##
## Number of Fisher Scoring iterations: 6
measure_train_ns_2 = fun_gg_cutoff(logistic_train_score_ns_2, bank_train_ns$y,
"acc", "sens")
measure_train_ns_2 +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")logistic_train_class_ns_2 = fun_cut_predict(logistic_train_score_ns_2, 0.2)
# matrix
logistic_train_confm_ns_2 = confusionMatrix(logistic_train_class_ns_2, bank_train_ns$y,
positive = "1")
logistic_train_confm_ns_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25535 1571
## 1 2744 2081
##
## Accuracy : 0.8649
## 95% CI : (0.8611, 0.8686)
## No Information Rate : 0.8856
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4148
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.56982
## Specificity : 0.90297
## Pos Pred Value : 0.43130
## Neg Pred Value : 0.94204
## Prevalence : 0.11437
## Detection Rate : 0.06517
## Detection Prevalence : 0.15111
## Balanced Accuracy : 0.73640
##
## 'Positive' Class : 1
##
measure_test_ns_2 = fun_gg_cutoff(logistic_test_score_ns_2, bank_test_ns$y,
"acc", "sens")
measure_test_ns_2 +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")logistic_test_class_ns_2 = fun_cut_predict(logistic_test_score_ns_2, 0.2)
# matrix
logistic_test_confm_ns_2 = confusionMatrix(logistic_test_class_ns_2, bank_test_ns$y,
positive = "1")
logistic_test_confm_ns_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6386 417
## 1 683 495
##
## Accuracy : 0.8622
## 95% CI : (0.8544, 0.8697)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3959
##
## Mcnemar's Test P-Value : 1.349e-15
##
## Sensitivity : 0.54276
## Specificity : 0.90338
## Pos Pred Value : 0.42020
## Neg Pred Value : 0.93870
## Prevalence : 0.11427
## Detection Rate : 0.06202
## Detection Prevalence : 0.14760
## Balanced Accuracy : 0.72307
##
## 'Positive' Class : 1
##
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
cp_best_ns = tree_ns$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_ns = prune(tree_ns,
cp = cp_best_ns)measure_train_ns = fun_gg_cutoff(tree_opt_train_score_ns, bank_train_ns$y,
"acc", "sens")
measure_train_ns +
geom_vline(xintercept = c(0.25, 0.5),
linetype = "dashed")tree_opt_train_class_ns = fun_cut_predict(tree_opt_train_score_ns, 0.25)
tree_opt_train_confm_ns = confusionMatrix(tree_opt_train_class_ns, bank_train_ns$y,
positive = "1")
tree_opt_train_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26477 1979
## 1 1802 1673
##
## Accuracy : 0.8816
## 95% CI : (0.878, 0.8851)
## No Information Rate : 0.8856
## P-Value [Acc > NIR] : 0.988329
##
## Kappa : 0.4029
##
## Mcnemar's Test P-Value : 0.004206
##
## Sensitivity : 0.45811
## Specificity : 0.93628
## Pos Pred Value : 0.48144
## Neg Pred Value : 0.93045
## Prevalence : 0.11437
## Detection Rate : 0.05239
## Detection Prevalence : 0.10883
## Balanced Accuracy : 0.69719
##
## 'Positive' Class : 1
##
measure_test_ns = fun_gg_cutoff(tree_opt_test_score_ns, bank_test_ns$y,
"acc", "sens")
measure_test_ns +
geom_vline(xintercept = c(0.25, 0.5),
linetype = "dashed")tree_opt_test_class_ns = fun_cut_predict(tree_opt_test_score_ns, 0.25)
# matrix
tree_opt_test_confm_ns = confusionMatrix(tree_opt_test_class_ns, bank_test_ns$y,
positive = "1")
tree_opt_test_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6619 520
## 1 450 392
##
## Accuracy : 0.8785
## 95% CI : (0.8711, 0.8856)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 0.97953
##
## Kappa : 0.3788
##
## Mcnemar's Test P-Value : 0.02673
##
## Sensitivity : 0.42982
## Specificity : 0.93634
## Pos Pred Value : 0.46556
## Neg Pred Value : 0.92716
## Prevalence : 0.11427
## Detection Rate : 0.04912
## Detection Prevalence : 0.10550
## Balanced Accuracy : 0.68308
##
## 'Positive' Class : 1
##
tree_ns_2 = rpart(y ~ nr.employed + euribor3m + pdays_dummy + poutcome + month,
data = bank_train_ns,
cp = 0.0005)cp_best_ns_2 = tree_ns_2$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_ns_2 = prune(tree_ns_2,
cp = cp_best_ns_2)measure_train_ns_2 = fun_gg_cutoff(tree_opt_train_score_ns_2, bank_train_ns$y,
"acc", "sens")
measure_train_ns_2 +
geom_vline(xintercept = c(0.3, 0.5),
linetype = "dashed")tree_opt_train_class_ns_2 = fun_cut_predict(tree_opt_train_score_ns_2, 0.3)
tree_opt_train_confm_ns_2 = confusionMatrix(tree_opt_train_class_ns_2, bank_train_ns$y,
positive = "1")
tree_opt_train_confm_ns_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25928 1682
## 1 2351 1970
##
## Accuracy : 0.8737
## 95% CI : (0.87, 0.8773)
## No Information Rate : 0.8856
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4226
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5394
## Specificity : 0.9169
## Pos Pred Value : 0.4559
## Neg Pred Value : 0.9391
## Prevalence : 0.1144
## Detection Rate : 0.0617
## Detection Prevalence : 0.1353
## Balanced Accuracy : 0.7281
##
## 'Positive' Class : 1
##
measure_test_ns = fun_gg_cutoff(tree_opt_test_score_ns_2, bank_test_ns$y,
"acc", "sens")
measure_test_ns +
geom_vline(xintercept = c(0.3, 0.5),
linetype = "dashed")tree_opt_test_class_ns_2 = fun_cut_predict(tree_opt_test_score_ns_2, 0.3)
# matrix
tree_opt_test_confm_ns_2 = confusionMatrix(tree_opt_test_class_ns_2, bank_test_ns$y,
positive = "1")
tree_opt_test_confm_ns_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6487 447
## 1 582 465
##
## Accuracy : 0.8711
## 95% CI : (0.8635, 0.8783)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4016
##
## Mcnemar's Test P-Value : 2.95e-05
##
## Sensitivity : 0.50987
## Specificity : 0.91767
## Pos Pred Value : 0.44413
## Neg Pred Value : 0.93554
## Prevalence : 0.11427
## Detection Rate : 0.05826
## Detection Prevalence : 0.13119
## Balanced Accuracy : 0.71377
##
## 'Positive' Class : 1
##
rf_ns = ranger(y ~ .,
data = bank_train_ns,
num.trees = 1000,
importance = "impurity",
write.forest = T,
probability = T)## Ranger result
##
## Call:
## ranger(y ~ ., data = bank_train_ns, num.trees = 1000, importance = "impurity", write.forest = T, probability = T)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 31931
## Number of independent variables: 15
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.07761157
measure_train_ns = fun_gg_cutoff(rf_train_score_ns, bank_train_ns$y,
"acc", "sens")
measure_train_ns +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")rf_train_class_ns = fun_cut_predict(rf_train_score_ns, 0.125)
# matrix
rf_train_confm_ns = confusionMatrix(rf_train_class_ns, bank_train_ns$y,
positive = "1")
rf_train_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24945 858
## 1 3334 2794
##
## Accuracy : 0.8687
## 95% CI : (0.865, 0.8724)
## No Information Rate : 0.8856
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4997
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7651
## Specificity : 0.8821
## Pos Pred Value : 0.4559
## Neg Pred Value : 0.9667
## Prevalence : 0.1144
## Detection Rate : 0.0875
## Detection Prevalence : 0.1919
## Balanced Accuracy : 0.8236
##
## 'Positive' Class : 1
##
measure_test_ns = fun_gg_cutoff(rf_test_score_ns, bank_test_ns$y,
"acc", "sens")
measure_test_ns +
geom_vline(xintercept = c(0.2, 0.5),
linetype = "dashed")rf_test_class_ns = fun_cut_predict(rf_test_score_ns, 0.2)
# matrix
rf_test_confm_ns = confusionMatrix(rf_test_class_ns, bank_test_ns$y,
positive = "1")
rf_test_confm_ns## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6402 402
## 1 667 510
##
## Accuracy : 0.8661
## 95% CI : (0.8584, 0.8735)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4126
##
## Mcnemar's Test P-Value : 6.776e-16
##
## Sensitivity : 0.5592
## Specificity : 0.9056
## Pos Pred Value : 0.4333
## Neg Pred Value : 0.9409
## Prevalence : 0.1143
## Detection Rate : 0.0639
## Detection Prevalence : 0.1475
## Balanced Accuracy : 0.7324
##
## 'Positive' Class : 1
##
score_train_ns = data.frame("logistic complex" = logistic_train_score_ns,
"logistic simple" = logistic_train_score_ns_2,
"tree complex" = tree_opt_train_score_ns,
"tree simple" = tree_opt_train_score_ns_2,
"random forest" = rf_train_score_ns,
"obs" = as.numeric(bank_train_ns$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_train_ns = score_train_ns %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Train dataset")
print(roc_train_ns)data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_train_ns$y, logistic_train_score_ns),
auc(bank_train_ns$y, logistic_train_score_ns_2),
auc(bank_train_ns$y, tree_opt_train_score_ns),
auc(bank_train_ns$y, tree_opt_train_score_ns_2),
auc(bank_train_ns$y, rf_train_score_ns)),
"Threshold" = c(0.2, 0.2, 0.25, 0.3, 0.125),
"Accuracy" = c(logistic_train_confm_ns[["overall"]][["Accuracy"]],
logistic_train_confm_ns_2[["overall"]][["Accuracy"]],
tree_opt_train_confm_ns[["overall"]][["Accuracy"]],
tree_opt_train_confm_ns_2[["overall"]][["Accuracy"]],
rf_train_confm_ns[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_train_confm_ns[["byClass"]][["Sensitivity"]],
logistic_train_confm_ns_2[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_ns[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_ns_2[["byClass"]][["Sensitivity"]],
rf_train_confm_ns[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7939029 0.200 0.8658044 0.5654436
## 2 Logistic regression (simple) 0.7927532 0.200 0.8648649 0.5698248
## 3 Decision tree (complex) 0.7112919 0.250 0.8815884 0.4581051
## 4 Decision tree (simple) 0.7587257 0.300 0.8736964 0.5394304
## 5 Random forest (ranger) 0.9205616 0.125 0.8687169 0.7650602
score_test_ns = data.frame("logistic (complex)" = logistic_test_score_ns,
"logistic (simple)" = logistic_test_score_ns_2,
"tree complex" = tree_opt_test_score_ns,
"tree simple" = tree_opt_test_score_ns_2,
"random forest" = rf_test_score_ns,
"obs" = as.numeric(bank_test_ns$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_test_ns = score_test_ns %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Validation dataset")
roc_test_nsdf_final_ns = data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_test_ns$y, logistic_test_score_ns),
auc(bank_test_ns$y, logistic_test_score_ns_2),
auc(bank_test_ns$y, tree_opt_test_score_ns),
auc(bank_test_ns$y, tree_opt_test_score_ns_2),
auc(bank_test_ns$y, rf_test_score_ns)),
"Threshold" = c(0.2, 0.2, 0.25, 0.3, 0.125),
"Accuracy" = c(logistic_test_confm_ns[["overall"]][["Accuracy"]],
logistic_test_confm_ns_2[["overall"]][["Accuracy"]],
tree_opt_test_confm_ns[["overall"]][["Accuracy"]],
tree_opt_test_confm_ns_2[["overall"]][["Accuracy"]],
rf_test_confm_ns[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_test_confm_ns[["byClass"]][["Sensitivity"]],
logistic_test_confm_ns_2[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_ns[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_ns_2[["byClass"]][["Sensitivity"]],
rf_test_confm_ns[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)
df_final_ns## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7749897 0.200 0.8634256 0.5394737
## 2 Logistic regression (simple) 0.7756386 0.200 0.8621727 0.5427632
## 3 Decision tree (complex) 0.6926397 0.250 0.8784613 0.4298246
## 4 Decision tree (simple) 0.7396911 0.300 0.8710688 0.5098684
## 5 Random forest (ranger) 0.7861381 0.125 0.8660569 0.5592105
set.seed(1234)
bank_train_ds = downSample(x = bank_train %>% select(-y),
y = bank_train$y,
yname = "y")
bank_test_ds = bank_test##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7700 -0.8696 -0.1514 0.8139 2.0738
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 88.912763 37.071219 2.398 0.016465 *
## agemid -0.185724 0.077777 -2.388 0.016944 *
## agehigh -0.342582 0.211457 -1.620 0.105210
## jobhousemaid 0.090610 0.260078 0.348 0.727544
## jobservices 0.066291 0.202461 0.327 0.743344
## jobadmin. 0.167694 0.186331 0.900 0.368130
## jobblue-collar 0.018984 0.192145 0.099 0.921297
## jobtechnician 0.232040 0.193374 1.200 0.230157
## jobretired 0.637648 0.236014 2.702 0.006898 **
## jobmanagement 0.164375 0.208252 0.789 0.429931
## jobself-employed 0.182019 0.233324 0.780 0.435323
## jobentrepreneur 0.402299 0.233001 1.727 0.084240 .
## jobstudent 0.458142 0.245423 1.867 0.061937 .
## maritalmarried 0.074860 0.069854 1.072 0.283870
## maritaldivorced 0.088578 0.102645 0.863 0.388165
## educationhigh.school 0.076348 0.123145 0.620 0.535267
## educationbasic.6y 0.158329 0.152674 1.037 0.299718
## educationbasic.9y 0.119823 0.123071 0.974 0.330252
## educationprofessional.course 0.077591 0.137371 0.565 0.572193
## educationuniversity.degree 0.166033 0.121409 1.368 0.171452
## contactcellular 0.334683 0.101432 3.300 0.000968 ***
## month(05)may -1.468083 0.221400 -6.631 3.34e-11 ***
## month(06)jun -0.422951 0.277630 -1.523 0.127650
## month(07)jul -0.479768 0.243097 -1.974 0.048431 *
## month(08)aug -0.786387 0.251893 -3.122 0.001797 **
## month(10)oct -0.377485 0.291274 -1.296 0.194982
## month(11)nov -1.160957 0.229159 -5.066 4.06e-07 ***
## month(12)dec -0.364799 0.446958 -0.816 0.414397
## month(04)apr -0.701019 0.236928 -2.959 0.003089 **
## month(09)sep -1.230486 0.298092 -4.128 3.66e-05 ***
## day_of_week(02)tue 0.211085 0.088848 2.376 0.017511 *
## day_of_week(03)wed 0.327703 0.087542 3.743 0.000182 ***
## day_of_week(04)thu 0.258029 0.086524 2.982 0.002862 **
## day_of_week(05)fri 0.215252 0.088210 2.440 0.014678 *
## campaign2 -0.039130 0.068827 -0.569 0.569673
## campaign3 0.019399 0.085767 0.226 0.821059
## campaign4 -0.008098 0.114729 -0.071 0.943727
## campaign5 -0.282353 0.153212 -1.843 0.065344 .
## campaign6 -0.147349 0.193541 -0.761 0.446458
## campaign7 -0.269657 0.256845 -1.050 0.293772
## campaign8 -0.395457 0.341930 -1.157 0.247458
## campaign9 -0.361462 0.377399 -0.958 0.338178
## campaign10 -0.412008 0.402783 -1.023 0.306354
## previous1 -0.140060 0.513467 -0.273 0.785028
## previous2+ -0.279052 0.589601 -0.473 0.636007
## poutcomefailure -0.312718 0.509859 -0.613 0.539650
## poutcomesuccess NA NA NA NA
## cons.price.idx -0.221133 0.206868 -1.069 0.285089
## cons.conf.idx 0.002998 0.013625 0.220 0.825822
## euribor3m 0.119074 0.197985 0.601 0.547554
## nr.employed -0.013277 0.003836 -3.461 0.000538 ***
## pdays_dummy1 1.465256 0.509746 2.874 0.004047 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10125.5 on 7303 degrees of freedom
## Residual deviance: 7813.7 on 7253 degrees of freedom
## AIC: 7915.7
##
## Number of Fisher Scoring iterations: 5
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
measure_train_ds = fun_gg_cutoff(logistic_train_score_ds, bank_train_ds$y,
"acc", "sens")
measure_train_ds +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_train_class_ds = fun_cut_predict(logistic_train_score_ds, 0.5)
# matrix
logistic_train_confm_ds = confusionMatrix(logistic_train_class_ds, bank_train_ds$y,
positive = "1")
logistic_train_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3081 1311
## 1 571 2341
##
## Accuracy : 0.7423
## 95% CI : (0.7321, 0.7523)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4847
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6410
## Specificity : 0.8436
## Pos Pred Value : 0.8039
## Neg Pred Value : 0.7015
## Prevalence : 0.5000
## Detection Rate : 0.3205
## Detection Prevalence : 0.3987
## Balanced Accuracy : 0.7423
##
## 'Positive' Class : 1
##
measure_test_ds = fun_gg_cutoff(logistic_test_score_ds, bank_test_ds$y,
"acc", "sens")
measure_test_ds +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_test_class_ds = fun_cut_predict(logistic_test_score_ds, 0.5)
# matrix
logistic_test_confm_ds = confusionMatrix(logistic_test_class_ds, bank_test_ds$y,
positive = "1")
logistic_test_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5942 365
## 1 1127 547
##
## Accuracy : 0.8131
## 95% CI : (0.8043, 0.8216)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3229
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.59978
## Specificity : 0.84057
## Pos Pred Value : 0.32676
## Neg Pred Value : 0.94213
## Prevalence : 0.11427
## Detection Rate : 0.06854
## Detection Prevalence : 0.20975
## Balanced Accuracy : 0.72018
##
## 'Positive' Class : 1
##
logistic_ds_2 = glm(y ~ . - marital - education - previous - campaign - cons.conf.idx - euribor3m,
data = bank_train_ds,
family = "binomial")##
## Call:
## glm(formula = y ~ . - marital - education - previous - campaign -
## cons.conf.idx - euribor3m, family = "binomial", data = bank_train_ds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8595 -0.8766 -0.1776 0.8226 2.0185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 68.4737588 6.2919393 10.883 < 2e-16 ***
## agemid -0.1580900 0.0731351 -2.162 0.030648 *
## agehigh -0.2977445 0.2061601 -1.444 0.148672
## jobhousemaid 0.0695177 0.2576437 0.270 0.787298
## jobservices 0.0438934 0.1996959 0.220 0.826026
## jobadmin. 0.1754483 0.1837791 0.955 0.339745
## jobblue-collar -0.0003024 0.1876683 -0.002 0.998714
## jobtechnician 0.2148402 0.1889763 1.137 0.255595
## jobretired 0.6037687 0.2340050 2.580 0.009876 **
## jobmanagement 0.2052460 0.2044126 1.004 0.315341
## jobself-employed 0.1928434 0.2314953 0.833 0.404826
## jobentrepreneur 0.4171944 0.2316734 1.801 0.071737 .
## jobstudent 0.4354593 0.2434884 1.788 0.073708 .
## contactcellular 0.2927957 0.0891209 3.285 0.001018 **
## month(05)may -1.4826630 0.2097093 -7.070 1.55e-12 ***
## month(06)jun -0.4883279 0.2248794 -2.172 0.029893 *
## month(07)jul -0.4886509 0.2236833 -2.185 0.028921 *
## month(08)aug -0.7177999 0.2211024 -3.246 0.001168 **
## month(10)oct -0.2655819 0.2801001 -0.948 0.343044
## month(11)nov -1.0934022 0.2231401 -4.900 9.58e-07 ***
## month(12)dec -0.2698585 0.4368450 -0.618 0.536744
## month(04)apr -0.7460241 0.2190123 -3.406 0.000658 ***
## month(09)sep -1.1017576 0.2824608 -3.901 9.60e-05 ***
## day_of_week(02)tue 0.2218305 0.0884197 2.509 0.012113 *
## day_of_week(03)wed 0.3374821 0.0871291 3.873 0.000107 ***
## day_of_week(04)thu 0.2638747 0.0859392 3.070 0.002137 **
## day_of_week(05)fri 0.2201694 0.0878446 2.506 0.012198 *
## poutcomefailure -0.4705050 0.0926822 -5.077 3.84e-07 ***
## poutcomesuccess -0.0844897 0.5020621 -0.168 0.866359
## cons.price.idx -0.1200751 0.0697635 -1.721 0.085219 .
## nr.employed -0.0110661 0.0005790 -19.114 < 2e-16 ***
## pdays_dummy1 1.3719416 0.4816075 2.849 0.004390 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10125.5 on 7303 degrees of freedom
## Residual deviance: 7828.1 on 7272 degrees of freedom
## AIC: 7892.1
##
## Number of Fisher Scoring iterations: 5
measure_train_ds_2 = fun_gg_cutoff(logistic_train_score_ds_2, bank_train_ds$y,
"acc", "sens")
measure_train_ds_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_train_class_ds_2 = fun_cut_predict(logistic_train_score_ds_2, 0.5)
# matrix
logistic_train_confm_ds_2 = confusionMatrix(logistic_train_class_ds_2, bank_train_ds$y,
positive = "1")
logistic_train_confm_ds_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3067 1311
## 1 585 2341
##
## Accuracy : 0.7404
## 95% CI : (0.7302, 0.7504)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4808
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6410
## Specificity : 0.8398
## Pos Pred Value : 0.8001
## Neg Pred Value : 0.7005
## Prevalence : 0.5000
## Detection Rate : 0.3205
## Detection Prevalence : 0.4006
## Balanced Accuracy : 0.7404
##
## 'Positive' Class : 1
##
measure_test_ds_2 = fun_gg_cutoff(logistic_test_score_ds_2, bank_test_ds$y,
"acc", "sens")
measure_test_ds_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_test_class_ds_2 = fun_cut_predict(logistic_test_score_ds_2, 0.5)
# matrix
logistic_test_confm_ds_2 = confusionMatrix(logistic_test_class_ds_2, bank_test_ds$y,
positive = "1")
logistic_test_confm_ds_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5915 359
## 1 1154 553
##
## Accuracy : 0.8104
## 95% CI : (0.8016, 0.819)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3212
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60636
## Specificity : 0.83675
## Pos Pred Value : 0.32396
## Neg Pred Value : 0.94278
## Prevalence : 0.11427
## Detection Rate : 0.06929
## Detection Prevalence : 0.21388
## Balanced Accuracy : 0.72156
##
## 'Positive' Class : 1
##
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
cp_best_ds = tree_ds$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_ds = prune(tree_ds,
cp = cp_best_ds)measure_train_ds = fun_gg_cutoff(tree_opt_train_score_ds, bank_train_ds$y,
"acc", "sens")
measure_train_ds +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_ds = fun_cut_predict(tree_opt_train_score_ds, 0.5)
tree_opt_train_confm_ds = confusionMatrix(tree_opt_train_class_ds, bank_train_ds$y,
positive = "1")
tree_opt_train_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3131 1232
## 1 521 2420
##
## Accuracy : 0.76
## 95% CI : (0.75, 0.7698)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.52
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6627
## Specificity : 0.8573
## Pos Pred Value : 0.8228
## Neg Pred Value : 0.7176
## Prevalence : 0.5000
## Detection Rate : 0.3313
## Detection Prevalence : 0.4027
## Balanced Accuracy : 0.7600
##
## 'Positive' Class : 1
##
measure_test_ds = fun_gg_cutoff(tree_opt_test_score_ds, bank_test_ds$y,
"acc", "sens")
measure_test_ds +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_ds = fun_cut_predict(tree_opt_test_score_ds, 0.5)
# matrix
tree_opt_test_confm_ds = confusionMatrix(tree_opt_test_class_ds, bank_test_ds$y,
positive = "1")
tree_opt_test_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5965 350
## 1 1104 562
##
## Accuracy : 0.8178
## 95% CI : (0.8092, 0.8262)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3383
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.61623
## Specificity : 0.84383
## Pos Pred Value : 0.33733
## Neg Pred Value : 0.94458
## Prevalence : 0.11427
## Detection Rate : 0.07042
## Detection Prevalence : 0.20875
## Balanced Accuracy : 0.73003
##
## 'Positive' Class : 1
##
tree_ds_2 = rpart(y ~ nr.employed + euribor3m + cons.conf.idx + pdays_dummy + poutcome + cons.price.idx + month,
data = bank_train_ds,
cp = 0.0005)cp_best_ds_2 = tree_ds_2$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_ds_2 = prune(tree_ds_2,
cp = cp_best_ds_2)measure_train_ds_2 = fun_gg_cutoff(tree_opt_train_score_ds_2, bank_train_ds$y,
"acc", "sens")
measure_train_ds_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_ds_2 = fun_cut_predict(tree_opt_train_score_ds_2, 0.5)
tree_opt_train_confm_ds_2 = confusionMatrix(tree_opt_train_class_ds_2, bank_train_ds$y,
positive = "1")
tree_opt_train_confm_ds_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3068 1209
## 1 584 2443
##
## Accuracy : 0.7545
## 95% CI : (0.7445, 0.7644)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.509
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6689
## Specificity : 0.8401
## Pos Pred Value : 0.8071
## Neg Pred Value : 0.7173
## Prevalence : 0.5000
## Detection Rate : 0.3345
## Detection Prevalence : 0.4144
## Balanced Accuracy : 0.7545
##
## 'Positive' Class : 1
##
measure_test_ds = fun_gg_cutoff(tree_opt_test_score_ds_2, bank_test_ds$y,
"acc", "sens")
measure_test_ds +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_ds_2 = fun_cut_predict(tree_opt_test_score_ds_2, 0.5)
# matrix
tree_opt_test_confm_ds_2 = confusionMatrix(tree_opt_test_class_ds_2, bank_test_ds$y,
positive = "1")
tree_opt_test_confm_ds_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5877 336
## 1 1192 576
##
## Accuracy : 0.8085
## 95% CI : (0.7997, 0.8171)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3286
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.63158
## Specificity : 0.83138
## Pos Pred Value : 0.32579
## Neg Pred Value : 0.94592
## Prevalence : 0.11427
## Detection Rate : 0.07217
## Detection Prevalence : 0.22153
## Balanced Accuracy : 0.73148
##
## 'Positive' Class : 1
##
rf_ds = ranger(y ~ .,
data = bank_train_ds,
num.trees = 1000,
importance = "impurity",
write.forest = T,
probability = T)## Ranger result
##
## Call:
## ranger(y ~ ., data = bank_train_ds, num.trees = 1000, importance = "impurity", write.forest = T, probability = T)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 7304
## Number of independent variables: 15
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1753084
measure_train_ds = fun_gg_cutoff(rf_train_score_ds, bank_train_ds$y,
"acc", "sens")
measure_train_ds +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_train_class_ds = fun_cut_predict(rf_train_score_ds, 0.375)
# matrix
rf_train_confm_ds = confusionMatrix(rf_train_class_ds, bank_train_ds$y,
positive = "1")
rf_train_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2940 527
## 1 712 3125
##
## Accuracy : 0.8304
## 95% CI : (0.8216, 0.8389)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6607
##
## Mcnemar's Test P-Value : 1.72e-07
##
## Sensitivity : 0.8557
## Specificity : 0.8050
## Pos Pred Value : 0.8144
## Neg Pred Value : 0.8480
## Prevalence : 0.5000
## Detection Rate : 0.4278
## Detection Prevalence : 0.5253
## Balanced Accuracy : 0.8304
##
## 'Positive' Class : 1
##
measure_test_ds = fun_gg_cutoff(rf_test_score_ds, bank_test_ds$y,
"acc", "sens")
measure_test_ds +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_test_class_ds = fun_cut_predict(rf_test_score_ds, 0.5)
# matrix
rf_test_confm_ds = confusionMatrix(rf_test_class_ds, bank_test_ds$y,
positive = "1")
rf_test_confm_ds## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5977 337
## 1 1092 575
##
## Accuracy : 0.8209
## 95% CI : (0.8124, 0.8293)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3499
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.63048
## Specificity : 0.84552
## Pos Pred Value : 0.34493
## Neg Pred Value : 0.94663
## Prevalence : 0.11427
## Detection Rate : 0.07205
## Detection Prevalence : 0.20887
## Balanced Accuracy : 0.73800
##
## 'Positive' Class : 1
##
score_train_ds = data.frame("logistic complex" = logistic_train_score_ds,
"logistic simple" = logistic_train_score_ds_2,
"tree complex" = tree_opt_train_score_ds,
"tree simple" = tree_opt_train_score_ds_2,
"random forest" = rf_train_score_ds,
"obs" = as.numeric(bank_train_ds$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_train_ds = score_train_ds %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Train dataset")
print(roc_train_ds)data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_train_ds$y, logistic_train_score_ds),
auc(bank_train_ds$y, logistic_train_score_ds_2),
auc(bank_train_ds$y, tree_opt_train_score_ds),
auc(bank_train_ds$y, tree_opt_train_score_ds_2),
auc(bank_train_ds$y, rf_train_score_ds)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_train_confm_ds[["overall"]][["Accuracy"]],
logistic_train_confm_ds_2[["overall"]][["Accuracy"]],
tree_opt_train_confm_ds[["overall"]][["Accuracy"]],
tree_opt_train_confm_ds_2[["overall"]][["Accuracy"]],
rf_train_confm_ds[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_train_confm_ds[["byClass"]][["Sensitivity"]],
logistic_train_confm_ds_2[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_ds[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_ds_2[["byClass"]][["Sensitivity"]],
rf_train_confm_ds[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7980779 0.5 0.7423330 0.6410186
## 2 Logistic regression (simple) 0.7971848 0.5 0.7404162 0.6410186
## 3 Decision tree (complex) 0.7891507 0.5 0.7599945 0.6626506
## 4 Decision tree (simple) 0.7742743 0.5 0.7545181 0.6689485
## 5 Random forest (ranger) 0.9166259 0.5 0.8303669 0.8556955
score_test_ds = data.frame("logistic (complex)" = logistic_test_score_ds,
"logistic (simple)" = logistic_test_score_ds_2,
"tree complex" = tree_opt_test_score_ds,
"tree simple" = tree_opt_test_score_ds_2,
"random forest" = rf_test_score_ds,
"obs" = as.numeric(bank_test_ds$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_test_ds = score_test_ds %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Validation dataset")
roc_test_dsdf_final_ds = data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_test_ds$y, logistic_test_score_ds),
auc(bank_test_ds$y, logistic_test_score_ds_2),
auc(bank_test_ds$y, tree_opt_test_score_ds),
auc(bank_test_ds$y, tree_opt_test_score_ds_2),
auc(bank_test_ds$y, rf_test_score_ds)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_test_confm_ds[["overall"]][["Accuracy"]],
logistic_test_confm_ds_2[["overall"]][["Accuracy"]],
tree_opt_test_confm_ds[["overall"]][["Accuracy"]],
tree_opt_test_confm_ds_2[["overall"]][["Accuracy"]],
rf_test_confm_ds[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_test_confm_ds[["byClass"]][["Sensitivity"]],
logistic_test_confm_ds_2[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_ds[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_ds_2[["byClass"]][["Sensitivity"]],
rf_test_confm_ds[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)
df_final_ds## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7776273 0.5 0.8130560 0.5997807
## 2 Logistic regression (simple) 0.7786925 0.5 0.8104248 0.6063596
## 3 Decision tree (complex) 0.7616706 0.5 0.8178173 0.6162281
## 4 Decision tree (simple) 0.7515769 0.5 0.8085453 0.6315789
## 5 Random forest (ranger) 0.7881108 0.5 0.8209498 0.6304825
set.seed(1234)
bank_train_us = upSample(x = bank_train %>% select(-y),
y = bank_train$y,
yname = "y")
bank_test_us = bank_test##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_us)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0217 -0.8684 -0.1270 0.8192 1.9972
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 56.6400226 13.4536911 4.210 2.55e-05 ***
## agemid -0.1692613 0.0282536 -5.991 2.09e-09 ***
## agehigh 0.0937154 0.0793862 1.180 0.237802
## jobhousemaid -0.1156831 0.0903583 -1.280 0.200450
## jobservices -0.1264988 0.0715798 -1.767 0.077188 .
## jobadmin. 0.0009011 0.0656151 0.014 0.989043
## jobblue-collar -0.1320699 0.0674178 -1.959 0.050116 .
## jobtechnician 0.0652448 0.0680259 0.959 0.337500
## jobretired 0.2472192 0.0837829 2.951 0.003170 **
## jobmanagement -0.0294588 0.0737249 -0.400 0.689467
## jobself-employed -0.0067750 0.0827360 -0.082 0.934737
## jobentrepreneur 0.1233806 0.0812800 1.518 0.129022
## jobstudent 0.2988550 0.0878194 3.403 0.000666 ***
## maritalmarried 0.0554527 0.0254305 2.181 0.029216 *
## maritaldivorced 0.1080766 0.0370867 2.914 0.003566 **
## educationhigh.school 0.1494877 0.0448192 3.335 0.000852 ***
## educationbasic.6y 0.2920990 0.0552003 5.292 1.21e-07 ***
## educationbasic.9y 0.0949928 0.0445902 2.130 0.033143 *
## educationprofessional.course 0.1354616 0.0499531 2.712 0.006692 **
## educationuniversity.degree 0.1795015 0.0441679 4.064 4.82e-05 ***
## contactcellular 0.4567357 0.0369481 12.362 < 2e-16 ***
## month(05)may -1.6351093 0.0840180 -19.461 < 2e-16 ***
## month(06)jun -0.6290126 0.1047298 -6.006 1.90e-09 ***
## month(07)jul -0.7086905 0.0920235 -7.701 1.35e-14 ***
## month(08)aug -1.0586539 0.0952225 -11.118 < 2e-16 ***
## month(10)oct -0.4070969 0.1086347 -3.747 0.000179 ***
## month(11)nov -1.2659289 0.0863720 -14.657 < 2e-16 ***
## month(12)dec -0.6711516 0.1559816 -4.303 1.69e-05 ***
## month(04)apr -0.8971128 0.0890845 -10.070 < 2e-16 ***
## month(09)sep -1.3985292 0.1112664 -12.569 < 2e-16 ***
## day_of_week(02)tue 0.1180179 0.0322000 3.665 0.000247 ***
## day_of_week(03)wed 0.2794737 0.0316446 8.832 < 2e-16 ***
## day_of_week(04)thu 0.1860305 0.0312275 5.957 2.56e-09 ***
## day_of_week(05)fri 0.1841214 0.0321461 5.728 1.02e-08 ***
## campaign2 -0.0132633 0.0247179 -0.537 0.591554
## campaign3 0.0498580 0.0312476 1.596 0.110583
## campaign4 0.0284793 0.0418671 0.680 0.496358
## campaign5 -0.1709731 0.0545138 -3.136 0.001711 **
## campaign6 -0.0738881 0.0686335 -1.077 0.281677
## campaign7 -0.3910252 0.0921780 -4.242 2.21e-05 ***
## campaign8 -0.6665128 0.1255954 -5.307 1.12e-07 ***
## campaign9 -0.2039070 0.1410048 -1.446 0.148149
## campaign10 -0.2936664 0.1536100 -1.912 0.055906 .
## previous1 0.3188200 0.1672746 1.906 0.056654 .
## previous2+ 0.1142981 0.1984781 0.576 0.564701
## poutcomefailure -0.8150449 0.1659705 -4.911 9.07e-07 ***
## poutcomesuccess NA NA NA NA
## cons.price.idx -0.0096777 0.0753486 -0.128 0.897801
## cons.conf.idx 0.0197910 0.0050271 3.937 8.25e-05 ***
## euribor3m -0.0212993 0.0716925 -0.297 0.766396
## nr.employed -0.0105780 0.0013871 -7.626 2.42e-14 ***
## pdays_dummy1 1.1722527 0.1648295 7.112 1.14e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 78406 on 56557 degrees of freedom
## Residual deviance: 60006 on 56507 degrees of freedom
## AIC: 60108
##
## Number of Fisher Scoring iterations: 5
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
measure_train_us = fun_gg_cutoff(logistic_train_score_us, bank_train_us$y,
"acc", "sens")
measure_train_us +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_train_class_us = fun_cut_predict(logistic_train_score_us, 0.5)
# matrix
logistic_train_confm_us = confusionMatrix(logistic_train_class_us, bank_train_us$y,
positive = "1")
logistic_train_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24111 10234
## 1 4168 18045
##
## Accuracy : 0.7454
## 95% CI : (0.7417, 0.7489)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4907
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6381
## Specificity : 0.8526
## Pos Pred Value : 0.8124
## Neg Pred Value : 0.7020
## Prevalence : 0.5000
## Detection Rate : 0.3191
## Detection Prevalence : 0.3927
## Balanced Accuracy : 0.7454
##
## 'Positive' Class : 1
##
measure_test_us = fun_gg_cutoff(logistic_test_score_us, bank_test_us$y,
"acc", "sens")
measure_test_us +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_test_class_us = fun_cut_predict(logistic_test_score_us, 0.5)
# matrix
logistic_test_confm_us = confusionMatrix(logistic_test_class_us, bank_test_us$y,
positive = "1")
logistic_test_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6016 371
## 1 1053 541
##
## Accuracy : 0.8216
## 95% CI : (0.813, 0.8299)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3351
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.59320
## Specificity : 0.85104
## Pos Pred Value : 0.33940
## Neg Pred Value : 0.94191
## Prevalence : 0.11427
## Detection Rate : 0.06779
## Detection Prevalence : 0.19972
## Balanced Accuracy : 0.72212
##
## 'Positive' Class : 1
##
##
## Call:
## glm(formula = y ~ . - cons.price.idx - euribor3m, family = "binomial",
## data = bank_train_us)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0216 -0.8687 -0.1287 0.8213 1.9931
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 58.1287988 1.0373972 56.033 < 2e-16 ***
## agemid -0.1703704 0.0282327 -6.035 1.59e-09 ***
## agehigh 0.0942782 0.0793556 1.188 0.234815
## jobhousemaid -0.1170609 0.0903389 -1.296 0.195046
## jobservices -0.1281882 0.0715445 -1.792 0.073177 .
## jobadmin. -0.0003758 0.0655878 -0.006 0.995428
## jobblue-collar -0.1328268 0.0673758 -1.971 0.048674 *
## jobtechnician 0.0628632 0.0679813 0.925 0.355115
## jobretired 0.2458144 0.0837528 2.935 0.003335 **
## jobmanagement -0.0302889 0.0737014 -0.411 0.681096
## jobself-employed -0.0080796 0.0827049 -0.098 0.922177
## jobentrepreneur 0.1238097 0.0812484 1.524 0.127548
## jobstudent 0.2972363 0.0877990 3.385 0.000711 ***
## maritalmarried 0.0547537 0.0254206 2.154 0.031247 *
## maritaldivorced 0.1070324 0.0370727 2.887 0.003888 **
## educationhigh.school 0.1501099 0.0448107 3.350 0.000808 ***
## educationbasic.6y 0.2926070 0.0551719 5.304 1.14e-07 ***
## educationbasic.9y 0.0955512 0.0445764 2.144 0.032070 *
## educationprofessional.course 0.1357168 0.0499521 2.717 0.006589 **
## educationuniversity.degree 0.1794330 0.0441525 4.064 4.83e-05 ***
## contactcellular 0.4747518 0.0334315 14.201 < 2e-16 ***
## month(05)may -1.6298604 0.0799383 -20.389 < 2e-16 ***
## month(06)jun -0.6261661 0.0845339 -7.407 1.29e-13 ***
## month(07)jul -0.7172775 0.0851193 -8.427 < 2e-16 ***
## month(08)aug -1.0544838 0.0913791 -11.540 < 2e-16 ***
## month(10)oct -0.4169102 0.1083661 -3.847 0.000119 ***
## month(11)nov -1.2682649 0.0864823 -14.665 < 2e-16 ***
## month(12)dec -0.6680157 0.1560079 -4.282 1.85e-05 ***
## month(04)apr -0.8896348 0.0825367 -10.779 < 2e-16 ***
## month(09)sep -1.4185115 0.1100721 -12.887 < 2e-16 ***
## day_of_week(02)tue 0.1178911 0.0321954 3.662 0.000251 ***
## day_of_week(03)wed 0.2793659 0.0316418 8.829 < 2e-16 ***
## day_of_week(04)thu 0.1862294 0.0311848 5.972 2.35e-09 ***
## day_of_week(05)fri 0.1848426 0.0321153 5.756 8.63e-09 ***
## campaign2 -0.0137088 0.0247116 -0.555 0.579063
## campaign3 0.0487829 0.0312224 1.562 0.118187
## campaign4 0.0283624 0.0418558 0.678 0.498012
## campaign5 -0.1708053 0.0545032 -3.134 0.001725 **
## campaign6 -0.0725155 0.0685905 -1.057 0.290410
## campaign7 -0.3903269 0.0921664 -4.235 2.29e-05 ***
## campaign8 -0.6644914 0.1255451 -5.293 1.20e-07 ***
## campaign9 -0.2045572 0.1409559 -1.451 0.146720
## campaign10 -0.2952504 0.1536090 -1.922 0.054594 .
## previous1 0.3160783 0.1674063 1.888 0.059014 .
## previous2+ 0.0946721 0.1979125 0.478 0.632399
## poutcomefailure -0.8090101 0.1660720 -4.871 1.11e-06 ***
## poutcomesuccess NA NA NA NA
## cons.conf.idx 0.0191473 0.0031135 6.150 7.76e-10 ***
## nr.employed -0.0110631 0.0002049 -53.986 < 2e-16 ***
## pdays_dummy1 1.1752476 0.1650095 7.122 1.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 78406 on 56557 degrees of freedom
## Residual deviance: 60008 on 56509 degrees of freedom
## AIC: 60106
##
## Number of Fisher Scoring iterations: 5
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
measure_train_us_2 = fun_gg_cutoff(logistic_train_score_us_2, bank_train_us$y,
"acc", "sens")
measure_train_us_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_train_class_us_2 = fun_cut_predict(logistic_train_score_us_2, 0.5)
# matrix
logistic_train_confm_us_2 = confusionMatrix(logistic_train_class_us_2, bank_train_us$y,
positive = "1")
logistic_train_confm_us_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24143 10279
## 1 4136 18000
##
## Accuracy : 0.7451
## 95% CI : (0.7415, 0.7487)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4903
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6365
## Specificity : 0.8537
## Pos Pred Value : 0.8132
## Neg Pred Value : 0.7014
## Prevalence : 0.5000
## Detection Rate : 0.3183
## Detection Prevalence : 0.3914
## Balanced Accuracy : 0.7451
##
## 'Positive' Class : 1
##
measure_test_us_2 = fun_gg_cutoff(logistic_test_score_us_2, bank_test_us$y,
"acc", "sens")
measure_test_us_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_test_class_us_2 = fun_cut_predict(logistic_test_score_us_2, 0.5)
# matrix
logistic_test_confm_us_2 = confusionMatrix(logistic_test_class_us_2, bank_test_us$y,
positive = "1")
logistic_test_confm_us_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6028 374
## 1 1041 538
##
## Accuracy : 0.8227
## 95% CI : (0.8141, 0.831)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3357
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.58991
## Specificity : 0.85274
## Pos Pred Value : 0.34072
## Neg Pred Value : 0.94158
## Prevalence : 0.11427
## Detection Rate : 0.06741
## Detection Prevalence : 0.19784
## Balanced Accuracy : 0.72132
##
## 'Positive' Class : 1
##
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
cp_best_us = tree_us$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_us = prune(tree_us,
cp = cp_best_us)## Warning: labs do not fit even at cex 0.15, there may be some overplotting
measure_train_us = fun_gg_cutoff(tree_opt_train_score_us, bank_train_us$y,
"acc", "sens")
measure_train_us +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_us = fun_cut_predict(tree_opt_train_score_us, 0.5)
tree_opt_train_confm_us = confusionMatrix(tree_opt_train_class_us, bank_train_us$y,
positive = "1")
tree_opt_train_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24286 9076
## 1 3993 19203
##
## Accuracy : 0.7689
## 95% CI : (0.7654, 0.7724)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5379
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6791
## Specificity : 0.8588
## Pos Pred Value : 0.8279
## Neg Pred Value : 0.7280
## Prevalence : 0.5000
## Detection Rate : 0.3395
## Detection Prevalence : 0.4101
## Balanced Accuracy : 0.7689
##
## 'Positive' Class : 1
##
measure_test_us = fun_gg_cutoff(tree_opt_test_score_us, bank_test_us$y,
"acc", "sens")
measure_test_us +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_us = fun_cut_predict(tree_opt_test_score_us, 0.5)
# matrix
tree_opt_test_confm_us = confusionMatrix(tree_opt_test_class_us, bank_test_us$y,
positive = "1")
tree_opt_test_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6045 364
## 1 1024 548
##
## Accuracy : 0.8261
## 95% CI : (0.8176, 0.8343)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3467
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60088
## Specificity : 0.85514
## Pos Pred Value : 0.34860
## Neg Pred Value : 0.94320
## Prevalence : 0.11427
## Detection Rate : 0.06866
## Detection Prevalence : 0.19697
## Balanced Accuracy : 0.72801
##
## 'Positive' Class : 1
##
tree_us_2 = rpart(y ~ nr.employed + euribor3m + cons.conf.idx + pdays_dummy + poutcome + cons.price.idx + month,
data = bank_train_us,
cp = 0.0005)cp_best_us_2 = tree_us_2$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_us_2 = prune(tree_us_2,
cp = cp_best_us_2)measure_train_us_2 = fun_gg_cutoff(tree_opt_train_score_us_2, bank_train_us$y,
"acc", "sens")
measure_train_us_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_us_2 = fun_cut_predict(tree_opt_train_score_us_2, 0.5)
tree_opt_train_confm_us_2 = confusionMatrix(tree_opt_train_class_us_2, bank_train_us$y,
positive = "1")
tree_opt_train_confm_us_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24394 9951
## 1 3885 18328
##
## Accuracy : 0.7554
## 95% CI : (0.7518, 0.7589)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5107
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6481
## Specificity : 0.8626
## Pos Pred Value : 0.8251
## Neg Pred Value : 0.7103
## Prevalence : 0.5000
## Detection Rate : 0.3241
## Detection Prevalence : 0.3927
## Balanced Accuracy : 0.7554
##
## 'Positive' Class : 1
##
measure_test_us = fun_gg_cutoff(tree_opt_test_score_us_2, bank_test_us$y,
"acc", "sens")
measure_test_us +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_us_2 = fun_cut_predict(tree_opt_test_score_us_2, 0.5)
# matrix
tree_opt_test_confm_us_2 = confusionMatrix(tree_opt_test_class_us_2, bank_test_us$y,
positive = "1")
tree_opt_test_confm_us_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6109 368
## 1 960 544
##
## Accuracy : 0.8336
## 95% CI : (0.8252, 0.8417)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3592
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.59649
## Specificity : 0.86420
## Pos Pred Value : 0.36170
## Neg Pred Value : 0.94318
## Prevalence : 0.11427
## Detection Rate : 0.06816
## Detection Prevalence : 0.18845
## Balanced Accuracy : 0.73034
##
## 'Positive' Class : 1
##
rf_us = ranger(y ~ .,
data = bank_train_us,
num.trees = 1000,
importance = "impurity",
write.forest = T,
probability = T)## Ranger result
##
## Call:
## ranger(y ~ ., data = bank_train_us, num.trees = 1000, importance = "impurity", write.forest = T, probability = T)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 56558
## Number of independent variables: 15
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1206238
measure_train_us = fun_gg_cutoff(rf_train_score_us, bank_train_us$y,
"acc", "sens")
measure_train_us +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_train_class_us = fun_cut_predict(rf_train_score_us, 0.375)
# matrix
rf_train_confm_us = confusionMatrix(rf_train_class_us, bank_train_us$y,
positive = "1")
rf_train_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23823 3063
## 1 4456 25216
##
## Accuracy : 0.8671
## 95% CI : (0.8642, 0.8698)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7341
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8917
## Specificity : 0.8424
## Pos Pred Value : 0.8498
## Neg Pred Value : 0.8861
## Prevalence : 0.5000
## Detection Rate : 0.4458
## Detection Prevalence : 0.5246
## Balanced Accuracy : 0.8671
##
## 'Positive' Class : 1
##
measure_test_us = fun_gg_cutoff(rf_test_score_us, bank_test_us$y,
"acc", "sens")
measure_test_us +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_test_class_us = fun_cut_predict(rf_test_score_us, 0.5)
# matrix
rf_test_confm_us = confusionMatrix(rf_test_class_us, bank_test_us$y,
positive = "1")
rf_test_confm_us## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6437 424
## 1 632 488
##
## Accuracy : 0.8677
## 95% CI : (0.8601, 0.875)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4054
##
## Mcnemar's Test P-Value : 1.89e-10
##
## Sensitivity : 0.53509
## Specificity : 0.91060
## Pos Pred Value : 0.43571
## Neg Pred Value : 0.93820
## Prevalence : 0.11427
## Detection Rate : 0.06115
## Detection Prevalence : 0.14033
## Balanced Accuracy : 0.72284
##
## 'Positive' Class : 1
##
score_train_us = data.frame("logistic complex" = logistic_train_score_us,
"logistic simple" = logistic_train_score_us_2,
"tree complex" = tree_opt_train_score_us,
"tree simple" = tree_opt_train_score_us_2,
"random forest" = rf_train_score_us,
"obs" = as.numeric(bank_train_us$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_train_us = score_train_us %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Train dataset")
print(roc_train_us)data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_train_us$y, logistic_train_score_us),
auc(bank_train_us$y, logistic_train_score_us_2),
auc(bank_train_us$y, tree_opt_train_score_us),
auc(bank_train_us$y, tree_opt_train_score_us_2),
auc(bank_train_us$y, rf_train_score_us)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_train_confm_us[["overall"]][["Accuracy"]],
logistic_train_confm_us_2[["overall"]][["Accuracy"]],
tree_opt_train_confm_us[["overall"]][["Accuracy"]],
tree_opt_train_confm_us_2[["overall"]][["Accuracy"]],
rf_train_confm_us[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_train_confm_us[["byClass"]][["Sensitivity"]],
logistic_train_confm_us_2[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_us[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_us_2[["byClass"]][["Sensitivity"]],
rf_train_confm_us[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.8027625 0.5 0.7453587 0.6381060
## 2 Logistic regression (simple) 0.8027843 0.5 0.7451289 0.6365147
## 3 Decision tree (complex) 0.8069836 0.5 0.7689275 0.6790551
## 4 Decision tree (simple) 0.7889196 0.5 0.7553662 0.6481134
## 5 Random forest (ranger) 0.9445846 0.5 0.8670568 0.8916864
score_test_us = data.frame("logistic (complex)" = logistic_test_score_us,
"logistic (simple)" = logistic_test_score_us_2,
"tree complex" = tree_opt_test_score_us,
"tree simple" = tree_opt_test_score_us_2,
"random forest" = rf_test_score_us,
"obs" = as.numeric(bank_test_us$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_test_us = score_test_us %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Validation dataset")
roc_test_usdf_final_us = data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_test_us$y, logistic_test_score_us),
auc(bank_test_us$y, logistic_test_score_us_2),
auc(bank_test_us$y, tree_opt_test_score_us),
auc(bank_test_us$y, tree_opt_test_score_us_2),
auc(bank_test_us$y, rf_test_score_us)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_test_confm_us[["overall"]][["Accuracy"]],
logistic_test_confm_us_2[["overall"]][["Accuracy"]],
tree_opt_test_confm_us[["overall"]][["Accuracy"]],
tree_opt_test_confm_us_2[["overall"]][["Accuracy"]],
rf_test_confm_us[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_test_confm_us[["byClass"]][["Sensitivity"]],
logistic_test_confm_us_2[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_us[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_us_2[["byClass"]][["Sensitivity"]],
rf_test_confm_us[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)
df_final_us## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7767975 0.5 0.8215762 0.5932018
## 2 Logistic regression (simple) 0.7767675 0.5 0.8227039 0.5899123
## 3 Decision tree (complex) 0.7566659 0.5 0.8260870 0.6008772
## 4 Decision tree (simple) 0.7519985 0.5 0.8336048 0.5964912
## 5 Random forest (ranger) 0.7809281 0.5 0.8676858 0.5350877
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train_smote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9080 -0.5557 -0.4229 0.3906 2.8333
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 93.2488632 19.7373337 4.724 2.31e-06 ***
## agemid -0.5344829 0.0452568 -11.810 < 2e-16 ***
## agehigh 0.3987200 0.1090570 3.656 0.000256 ***
## jobhousemaid -0.3802525 0.1612967 -2.357 0.018400 *
## jobservices -0.4159130 0.1229044 -3.384 0.000714 ***
## jobadmin. -0.3746699 0.1118170 -3.351 0.000806 ***
## jobblue-collar -0.4558465 0.1157280 -3.939 8.18e-05 ***
## jobtechnician -0.2632381 0.1154590 -2.280 0.022612 *
## jobretired -0.0589143 0.1344556 -0.438 0.661263
## jobmanagement -0.3870398 0.1285658 -3.010 0.002609 **
## jobself-employed -0.4034715 0.1504745 -2.681 0.007333 **
## jobentrepreneur -0.2579890 0.1457006 -1.771 0.076614 .
## jobstudent 0.1946682 0.1393482 1.397 0.162416
## maritalmarried -0.1742006 0.0427968 -4.070 4.69e-05 ***
## maritaldivorced 0.0410038 0.0646467 0.634 0.525901
## educationhigh.school 0.0705941 0.0764061 0.924 0.355521
## educationbasic.6y -0.0047153 0.1060676 -0.044 0.964541
## educationbasic.9y -0.0478872 0.0799833 -0.599 0.549363
## educationprofessional.course 0.0002181 0.0853987 0.003 0.997962
## educationuniversity.degree 0.0253369 0.0751128 0.337 0.735877
## contactcellular -0.2847054 0.0502567 -5.665 1.47e-08 ***
## month(05)may -1.6654771 0.1285150 -12.959 < 2e-16 ***
## month(06)jun -0.6333468 0.1483437 -4.269 1.96e-05 ***
## month(07)jul -0.5554402 0.1381080 -4.022 5.78e-05 ***
## month(08)aug -0.7539648 0.1426203 -5.287 1.25e-07 ***
## month(10)oct -0.6250673 0.1673719 -3.735 0.000188 ***
## month(11)nov -1.0770170 0.1382262 -7.792 6.61e-15 ***
## month(12)dec -0.6907535 0.2400215 -2.878 0.004004 **
## month(04)apr -0.8380774 0.1359455 -6.165 7.06e-10 ***
## month(09)sep -1.2121525 0.1672006 -7.250 4.18e-13 ***
## day_of_week(02)tue 0.2227235 0.0584656 3.809 0.000139 ***
## day_of_week(03)wed 0.1938470 0.0584308 3.318 0.000908 ***
## day_of_week(04)thu 0.2133142 0.0568266 3.754 0.000174 ***
## day_of_week(05)fri 0.1727889 0.0578978 2.984 0.002842 **
## campaign2 0.1206141 0.0445051 2.710 0.006726 **
## campaign3 0.2059940 0.0569927 3.614 0.000301 ***
## campaign4 0.0950811 0.0765380 1.242 0.214136
## campaign5 -0.1377385 0.1024735 -1.344 0.178904
## campaign6 -0.1368122 0.1243314 -1.100 0.271165
## campaign7 -0.1579635 0.1740050 -0.908 0.363979
## campaign8 -0.6284708 0.2375737 -2.645 0.008160 **
## campaign9 -0.2192995 0.3040784 -0.721 0.470790
## campaign10 -0.2868000 0.2896938 -0.990 0.322169
## previous1 0.9482097 0.0735581 12.891 < 2e-16 ***
## previous2+ 1.2651863 0.1005609 12.581 < 2e-16 ***
## poutcomefailure -1.1055365 0.0794350 -13.918 < 2e-16 ***
## poutcomesuccess 1.7184241 0.0872614 19.693 < 2e-16 ***
## cons.price.idx -0.2107228 0.1101344 -1.913 0.055707 .
## cons.conf.idx -0.0095942 0.0075382 -1.273 0.203111
## euribor3m 0.0892138 0.1068263 0.835 0.403645
## nr.employed -0.0142885 0.0020582 -6.942 3.86e-12 ***
## pdays_dummy1 1.8923284 0.0795544 23.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34916 on 25563 degrees of freedom
## Residual deviance: 19434 on 25512 degrees of freedom
## AIC: 19538
##
## Number of Fisher Scoring iterations: 6
measure_train_smote = fun_gg_cutoff(logistic_train_score_smote, bank_train_smote$y,
"acc", "sens")
measure_train_smote +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_train_class_smote = fun_cut_predict(logistic_train_score_smote, 0.5)
# matrix
logistic_train_confm_smote = confusionMatrix(logistic_train_class_smote, bank_train_smote$y,
positive = "1")
logistic_train_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13438 2482
## 1 1170 8474
##
## Accuracy : 0.8571
## 95% CI : (0.8528, 0.8614)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7039
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7735
## Specificity : 0.9199
## Pos Pred Value : 0.8787
## Neg Pred Value : 0.8441
## Prevalence : 0.4286
## Detection Rate : 0.3315
## Detection Prevalence : 0.3772
## Balanced Accuracy : 0.8467
##
## 'Positive' Class : 1
##
measure_test_smote = fun_gg_cutoff(logistic_test_score_smote, bank_test_smote$y,
"acc", "sens")
measure_test_smote +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")logistic_test_class_smote = fun_cut_predict(logistic_test_score_smote, 0.5)
# matrix
logistic_test_confm_smote = confusionMatrix(logistic_test_class_smote, bank_test_smote$y,
positive = "1")
logistic_test_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6473 465
## 1 596 447
##
## Accuracy : 0.8671
## 95% CI : (0.8594, 0.8744)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3819
##
## Mcnemar's Test P-Value : 6.579e-05
##
## Sensitivity : 0.49013
## Specificity : 0.91569
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.93298
## Prevalence : 0.11427
## Detection Rate : 0.05601
## Detection Prevalence : 0.13069
## Balanced Accuracy : 0.70291
##
## 'Positive' Class : 1
##
logistic_smote_2 = glm(y ~ . - education - euribor3m - cons.conf.idx,
data = bank_train_smote,
family = "binomial")##
## Call:
## glm(formula = y ~ . - education - euribor3m - cons.conf.idx,
## family = "binomial", data = bank_train_smote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8855 -0.5571 -0.4240 0.3897 2.8093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 76.7013362 3.7012677 20.723 < 2e-16 ***
## agemid -0.5359537 0.0451457 -11.872 < 2e-16 ***
## agehigh 0.3883324 0.1082084 3.589 0.000332 ***
## jobhousemaid -0.3839028 0.1600979 -2.398 0.016488 *
## jobservices -0.3881456 0.1216401 -3.191 0.001418 **
## jobadmin. -0.3583942 0.1110376 -3.228 0.001248 **
## jobblue-collar -0.4784466 0.1136135 -4.211 2.54e-05 ***
## jobtechnician -0.2637394 0.1139216 -2.315 0.020608 *
## jobretired -0.0614351 0.1341764 -0.458 0.647047
## jobmanagement -0.3769169 0.1271588 -2.964 0.003035 **
## jobself-employed -0.4050365 0.1498441 -2.703 0.006871 **
## jobentrepreneur -0.2508600 0.1452633 -1.727 0.084180 .
## jobstudent 0.2024349 0.1393701 1.452 0.146363
## maritalmarried -0.1802077 0.0425488 -4.235 2.28e-05 ***
## maritaldivorced 0.0388334 0.0645479 0.602 0.547426
## contactcellular -0.2659993 0.0475085 -5.599 2.16e-08 ***
## month(05)may -1.6948951 0.1262887 -13.421 < 2e-16 ***
## month(06)jun -0.6983883 0.1344518 -5.194 2.05e-07 ***
## month(07)jul -0.5979308 0.1341108 -4.458 8.25e-06 ***
## month(08)aug -0.8212427 0.1334215 -6.155 7.50e-10 ***
## month(10)oct -0.6525206 0.1626887 -4.011 6.05e-05 ***
## month(11)nov -1.0887553 0.1367547 -7.961 1.70e-15 ***
## month(12)dec -0.7289005 0.2372191 -3.073 0.002121 **
## month(04)apr -0.8622479 0.1314250 -6.561 5.35e-11 ***
## month(09)sep -1.2435974 0.1625352 -7.651 1.99e-14 ***
## day_of_week(02)tue 0.2216926 0.0583972 3.796 0.000147 ***
## day_of_week(03)wed 0.1921815 0.0583995 3.291 0.000999 ***
## day_of_week(04)thu 0.2099647 0.0567778 3.698 0.000217 ***
## day_of_week(05)fri 0.1721642 0.0578598 2.976 0.002925 **
## campaign2 0.1188650 0.0444799 2.672 0.007533 **
## campaign3 0.2065160 0.0569361 3.627 0.000287 ***
## campaign4 0.0917713 0.0765087 1.199 0.230338
## campaign5 -0.1348198 0.1023636 -1.317 0.187816
## campaign6 -0.1380927 0.1243314 -1.111 0.266705
## campaign7 -0.1608393 0.1737981 -0.925 0.354738
## campaign8 -0.6227392 0.2373027 -2.624 0.008684 **
## campaign9 -0.2195498 0.3041841 -0.722 0.470438
## campaign10 -0.2740653 0.2883995 -0.950 0.341961
## previous1 0.9494339 0.0735275 12.913 < 2e-16 ***
## previous2+ 1.2650879 0.1005895 12.577 < 2e-16 ***
## poutcomefailure -1.1068570 0.0794128 -13.938 < 2e-16 ***
## poutcomesuccess 1.7186775 0.0872048 19.709 < 2e-16 ***
## cons.price.idx -0.1210621 0.0415241 -2.915 0.003552 **
## nr.employed -0.0125620 0.0003288 -38.204 < 2e-16 ***
## pdays_dummy1 1.8921971 0.0795255 23.794 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34916 on 25563 degrees of freedom
## Residual deviance: 19439 on 25519 degrees of freedom
## AIC: 19529
##
## Number of Fisher Scoring iterations: 6
measure_train_smote_2 = fun_gg_cutoff(logistic_train_score_smote_2, bank_train_smote$y,
"acc", "sens")
measure_train_smote_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_train_class_smote_2 = fun_cut_predict(logistic_train_score_smote_2, 0.5)
# matrix
logistic_train_confm_smote_2 = confusionMatrix(logistic_train_class_smote_2, bank_train_smote$y,
positive = "1")
logistic_train_confm_smote_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13442 2482
## 1 1166 8474
##
## Accuracy : 0.8573
## 95% CI : (0.853, 0.8616)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7042
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7735
## Specificity : 0.9202
## Pos Pred Value : 0.8790
## Neg Pred Value : 0.8441
## Prevalence : 0.4286
## Detection Rate : 0.3315
## Detection Prevalence : 0.3771
## Balanced Accuracy : 0.8468
##
## 'Positive' Class : 1
##
measure_test_smote_2 = fun_gg_cutoff(logistic_test_score_smote_2, bank_test_smote$y,
"acc", "sens")
measure_test_smote_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")logistic_test_class_smote_2 = fun_cut_predict(logistic_test_score_smote_2, 0.5)
# matrix
logistic_test_confm_smote_2 = confusionMatrix(logistic_test_class_smote_2, bank_test_smote$y,
positive = "1")
logistic_test_confm_smote_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6475 462
## 1 594 450
##
## Accuracy : 0.8677
## 95% CI : (0.8601, 0.875)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3851
##
## Mcnemar's Test P-Value : 5.548e-05
##
## Sensitivity : 0.49342
## Specificity : 0.91597
## Pos Pred Value : 0.43103
## Neg Pred Value : 0.93340
## Prevalence : 0.11427
## Detection Rate : 0.05638
## Detection Prevalence : 0.13081
## Balanced Accuracy : 0.70470
##
## 'Positive' Class : 1
##
cp_best_smote = tree_smote$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_smote = prune(tree_smote,
cp = cp_best_smote)measure_train_smote = fun_gg_cutoff(tree_opt_train_score_smote, bank_train_smote$y,
"acc", "sens")
measure_train_smote +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_smote = fun_cut_predict(tree_opt_train_score_smote, 0.5)
tree_opt_train_confm_smote = confusionMatrix(tree_opt_train_class_smote, bank_train_smote$y,
positive = "1")
tree_opt_train_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13697 1827
## 1 911 9129
##
## Accuracy : 0.8929
## 95% CI : (0.889, 0.8967)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.779
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8332
## Specificity : 0.9376
## Pos Pred Value : 0.9093
## Neg Pred Value : 0.8823
## Prevalence : 0.4286
## Detection Rate : 0.3571
## Detection Prevalence : 0.3927
## Balanced Accuracy : 0.8854
##
## 'Positive' Class : 1
##
measure_test_smote = fun_gg_cutoff(tree_opt_test_score_smote, bank_test_smote$y,
"acc", "sens")
measure_test_smote +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_smote = fun_cut_predict(tree_opt_test_score_smote, 0.5)
# matrix
tree_opt_test_confm_smote = confusionMatrix(tree_opt_test_class_smote, bank_test_smote$y,
positive = "1")
tree_opt_test_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6575 491
## 1 494 421
##
## Accuracy : 0.8766
## 95% CI : (0.8692, 0.8837)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 0.9948
##
## Kappa : 0.3912
##
## Mcnemar's Test P-Value : 0.9492
##
## Sensitivity : 0.46162
## Specificity : 0.93012
## Pos Pred Value : 0.46011
## Neg Pred Value : 0.93051
## Prevalence : 0.11427
## Detection Rate : 0.05275
## Detection Prevalence : 0.11465
## Balanced Accuracy : 0.69587
##
## 'Positive' Class : 1
##
tree_smote_2 = rpart(y ~ nr.employed + euribor3m + cons.conf.idx + pdays_dummy +
poutcome + cons.price.idx + month + previous,
data = bank_train_smote,
cp = 0.0005)cp_best_smote_2 = tree_smote_2$cptable %>%
as.data.frame %>%
filter(xerror == min(xerror)) %>%
select(CP) %>%
slice(1) %>%
as.numeric()
tree_opt_smote_2 = prune(tree_smote_2,
cp = cp_best_smote_2)measure_train_smote_2 = fun_gg_cutoff(tree_opt_train_score_smote_2, bank_train_smote$y,
"acc", "sens")
measure_train_smote_2 +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_train_class_smote_2 = fun_cut_predict(tree_opt_train_score_smote_2, 0.5)
tree_opt_train_confm_smote_2 = confusionMatrix(tree_opt_train_class_smote_2, bank_train_smote$y,
positive = "1")
tree_opt_train_confm_smote_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13471 1770
## 1 1137 9186
##
## Accuracy : 0.8863
## 95% CI : (0.8823, 0.8902)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7661
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8384
## Specificity : 0.9222
## Pos Pred Value : 0.8899
## Neg Pred Value : 0.8839
## Prevalence : 0.4286
## Detection Rate : 0.3593
## Detection Prevalence : 0.4038
## Balanced Accuracy : 0.8803
##
## 'Positive' Class : 1
##
measure_test_smote = fun_gg_cutoff(tree_opt_test_score_smote_2, bank_test_smote$y,
"acc", "sens")
measure_test_smote +
geom_vline(xintercept = c(0.5),
linetype = "dashed")tree_opt_test_class_smote_2 = fun_cut_predict(tree_opt_test_score_smote_2, 0.5)
# matrix
tree_opt_test_confm_smote_2 = confusionMatrix(tree_opt_test_class_smote_2, bank_test_smote$y,
positive = "1")
tree_opt_test_confm_smote_2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6513 458
## 1 556 454
##
## Accuracy : 0.8729
## 95% CI : (0.8654, 0.8802)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 0.999811
##
## Kappa : 0.4004
##
## Mcnemar's Test P-Value : 0.002318
##
## Sensitivity : 0.49781
## Specificity : 0.92135
## Pos Pred Value : 0.44950
## Neg Pred Value : 0.93430
## Prevalence : 0.11427
## Detection Rate : 0.05689
## Detection Prevalence : 0.12655
## Balanced Accuracy : 0.70958
##
## 'Positive' Class : 1
##
rf_smote = ranger(y ~ .,
data = bank_train_smote,
num.trees = 1000,
importance = "impurity",
write.forest = T,
probability = T)## Ranger result
##
## Call:
## ranger(y ~ ., data = bank_train_smote, num.trees = 1000, importance = "impurity", write.forest = T, probability = T)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 25564
## Number of independent variables: 15
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.07880187
measure_train_smote = fun_gg_cutoff(rf_train_score_smote, bank_train_smote$y,
"acc", "sens")
measure_train_smote +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_train_class_smote = fun_cut_predict(rf_train_score_smote, 0.375)
# matrix
rf_train_confm_smote = confusionMatrix(rf_train_class_smote, bank_train_smote$y,
positive = "1")
rf_train_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13810 1383
## 1 798 9573
##
## Accuracy : 0.9147
## 95% CI : (0.9112, 0.9181)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8246
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8738
## Specificity : 0.9454
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.9090
## Prevalence : 0.4286
## Detection Rate : 0.3745
## Detection Prevalence : 0.4057
## Balanced Accuracy : 0.9096
##
## 'Positive' Class : 1
##
measure_test_smote = fun_gg_cutoff(rf_test_score_smote, bank_test_smote$y,
"acc", "sens")
measure_test_smote +
geom_vline(xintercept = c(0.375, 0.5),
linetype = "dashed")rf_test_class_smote = fun_cut_predict(rf_test_score_smote, 0.5)
# matrix
rf_test_confm_smote = confusionMatrix(rf_test_class_smote, bank_test_smote$y,
positive = "1")
rf_test_confm_smote## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6579 483
## 1 490 429
##
## Accuracy : 0.8781
## 95% CI : (0.8707, 0.8852)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 0.9841
##
## Kappa : 0.3997
##
## Mcnemar's Test P-Value : 0.8475
##
## Sensitivity : 0.47039
## Specificity : 0.93068
## Pos Pred Value : 0.46681
## Neg Pred Value : 0.93161
## Prevalence : 0.11427
## Detection Rate : 0.05375
## Detection Prevalence : 0.11515
## Balanced Accuracy : 0.70054
##
## 'Positive' Class : 1
##
score_train_smote = data.frame("logistic complex" = logistic_train_score_smote,
"logistic simple" = logistic_train_score_smote_2,
"tree complex" = tree_opt_train_score_smote,
"tree simple" = tree_opt_train_score_smote_2,
"random forest" = rf_train_score_smote,
"obs" = as.numeric(bank_train_smote$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_train_smote = score_train_smote %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Train dataset")
print(roc_train_smote)data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_train_smote$y, logistic_train_score_smote),
auc(bank_train_smote$y, logistic_train_score_smote_2),
auc(bank_train_smote$y, tree_opt_train_score_smote),
auc(bank_train_smote$y, tree_opt_train_score_smote_2),
auc(bank_train_smote$y, rf_train_score_smote)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_train_confm_smote[["overall"]][["Accuracy"]],
logistic_train_confm_smote_2[["overall"]][["Accuracy"]],
tree_opt_train_confm_smote[["overall"]][["Accuracy"]],
tree_opt_train_confm_smote_2[["overall"]][["Accuracy"]],
rf_train_confm_smote[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_train_confm_smote[["byClass"]][["Sensitivity"]],
logistic_train_confm_smote_2[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_smote[["byClass"]][["Sensitivity"]],
tree_opt_train_confm_smote_2[["byClass"]][["Sensitivity"]],
rf_train_confm_smote[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.9008535 0.5 0.8571429 0.7734575
## 2 Logistic regression (simple) 0.9007183 0.5 0.8572993 0.7734575
## 3 Decision tree (complex) 0.9201630 0.5 0.8928963 0.8332421
## 4 Decision tree (simple) 0.9077987 0.5 0.8862854 0.8384447
## 5 Random forest (ranger) 0.9721808 0.5 0.9146847 0.8737678
score_test_smote = data.frame("logistic (complex)" = logistic_test_score_smote,
"logistic (simple)" = logistic_test_score_smote_2,
"tree complex" = tree_opt_test_score_smote,
"tree simple" = tree_opt_test_score_smote_2,
"random forest" = rf_test_score_smote,
"obs" = as.numeric(bank_test_smote$y) - 1) %>%
gather(key = "Method", value = "score", -obs)
roc_test_smote = score_test_smote %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc() +
ggtitle("Validation dataset")
roc_test_smotedf_final_smote = data.frame("Model" = c("Logistic regression (complex)",
"Logistic regression (simple)",
"Decision tree (complex)",
"Decision tree (simple)",
"Random forest (ranger)"),
"AUC" = c(auc(bank_test_smote$y, logistic_test_score_smote),
auc(bank_test_smote$y, logistic_test_score_smote_2),
auc(bank_test_smote$y, tree_opt_test_score_smote),
auc(bank_test_smote$y, tree_opt_test_score_smote_2),
auc(bank_test_smote$y, rf_test_score_smote)),
"Threshold" = c(0.5, 0.5, 0.5, 0.5, 0.5),
"Accuracy" = c(logistic_test_confm_smote[["overall"]][["Accuracy"]],
logistic_test_confm_smote_2[["overall"]][["Accuracy"]],
tree_opt_test_confm_smote[["overall"]][["Accuracy"]],
tree_opt_test_confm_smote_2[["overall"]][["Accuracy"]],
rf_test_confm_smote[["overall"]][["Accuracy"]]),
"Sensitivity" = c(logistic_test_confm_smote[["byClass"]][["Sensitivity"]],
logistic_test_confm_smote_2[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_smote[["byClass"]][["Sensitivity"]],
tree_opt_test_confm_smote_2[["byClass"]][["Sensitivity"]],
rf_test_confm_smote[["byClass"]][["Sensitivity"]]),
stringsAsFactors = F)
df_final_smote## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7701089 0.5 0.8670593 0.4901316
## 2 Logistic regression (simple) 0.7697226 0.5 0.8676858 0.4934211
## 3 Decision tree (complex) 0.7607222 0.5 0.8765819 0.4616228
## 4 Decision tree (simple) 0.7628825 0.5 0.8729483 0.4978070
## 5 Random forest (ranger) 0.7851047 0.5 0.8780855 0.4703947
## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7749897 0.200 0.8634256 0.5394737
## 2 Logistic regression (simple) 0.7756386 0.200 0.8621727 0.5427632
## 3 Decision tree (complex) 0.6926397 0.250 0.8784613 0.4298246
## 4 Decision tree (simple) 0.7396911 0.300 0.8710688 0.5098684
## 5 Random forest (ranger) 0.7861381 0.125 0.8660569 0.5592105
## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7776273 0.5 0.8130560 0.5997807
## 2 Logistic regression (simple) 0.7786925 0.5 0.8104248 0.6063596
## 3 Decision tree (complex) 0.7616706 0.5 0.8178173 0.6162281
## 4 Decision tree (simple) 0.7515769 0.5 0.8085453 0.6315789
## 5 Random forest (ranger) 0.7881108 0.5 0.8209498 0.6304825
## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7767975 0.5 0.8215762 0.5932018
## 2 Logistic regression (simple) 0.7767675 0.5 0.8227039 0.5899123
## 3 Decision tree (complex) 0.7566659 0.5 0.8260870 0.6008772
## 4 Decision tree (simple) 0.7519985 0.5 0.8336048 0.5964912
## 5 Random forest (ranger) 0.7809281 0.5 0.8676858 0.5350877
## Model AUC Threshold Accuracy Sensitivity
## 1 Logistic regression (complex) 0.7701089 0.5 0.8670593 0.4901316
## 2 Logistic regression (simple) 0.7697226 0.5 0.8676858 0.4934211
## 3 Decision tree (complex) 0.7607222 0.5 0.8765819 0.4616228
## 4 Decision tree (simple) 0.7628825 0.5 0.8729483 0.4978070
## 5 Random forest (ranger) 0.7851047 0.5 0.8780855 0.4703947
df_final_tot = bind_rows(list("ns" = df_final_ns,
"ds" = df_final_ds,
"us" = df_final_us,
"smote" = df_final_smote),
.id = "sampling")
df_final_tot## sampling Model AUC Threshold Accuracy
## 1 ns Logistic regression (complex) 0.7749897 0.200 0.8634256
## 2 ns Logistic regression (simple) 0.7756386 0.200 0.8621727
## 3 ns Decision tree (complex) 0.6926397 0.250 0.8784613
## 4 ns Decision tree (simple) 0.7396911 0.300 0.8710688
## 5 ns Random forest (ranger) 0.7861381 0.125 0.8660569
## 6 ds Logistic regression (complex) 0.7776273 0.500 0.8130560
## 7 ds Logistic regression (simple) 0.7786925 0.500 0.8104248
## 8 ds Decision tree (complex) 0.7616706 0.500 0.8178173
## 9 ds Decision tree (simple) 0.7515769 0.500 0.8085453
## 10 ds Random forest (ranger) 0.7881108 0.500 0.8209498
## 11 us Logistic regression (complex) 0.7767975 0.500 0.8215762
## 12 us Logistic regression (simple) 0.7767675 0.500 0.8227039
## 13 us Decision tree (complex) 0.7566659 0.500 0.8260870
## 14 us Decision tree (simple) 0.7519985 0.500 0.8336048
## 15 us Random forest (ranger) 0.7809281 0.500 0.8676858
## 16 smote Logistic regression (complex) 0.7701089 0.500 0.8670593
## 17 smote Logistic regression (simple) 0.7697226 0.500 0.8676858
## 18 smote Decision tree (complex) 0.7607222 0.500 0.8765819
## 19 smote Decision tree (simple) 0.7628825 0.500 0.8729483
## 20 smote Random forest (ranger) 0.7851047 0.500 0.8780855
## Sensitivity
## 1 0.5394737
## 2 0.5427632
## 3 0.4298246
## 4 0.5098684
## 5 0.5592105
## 6 0.5997807
## 7 0.6063596
## 8 0.6162281
## 9 0.6315789
## 10 0.6304825
## 11 0.5932018
## 12 0.5899123
## 13 0.6008772
## 14 0.5964912
## 15 0.5350877
## 16 0.4901316
## 17 0.4934211
## 18 0.4616228
## 19 0.4978070
## 20 0.4703947
## sampling Model AUC Threshold Accuracy
## 1 ns Decision tree (complex) 0.6926397 0.250 0.8784613
## 2 smote Random forest (ranger) 0.7851047 0.500 0.8780855
## 3 smote Decision tree (complex) 0.7607222 0.500 0.8765819
## 4 smote Decision tree (simple) 0.7628825 0.500 0.8729483
## 5 ns Decision tree (simple) 0.7396911 0.300 0.8710688
## 6 us Random forest (ranger) 0.7809281 0.500 0.8676858
## 7 smote Logistic regression (simple) 0.7697226 0.500 0.8676858
## 8 smote Logistic regression (complex) 0.7701089 0.500 0.8670593
## 9 ns Random forest (ranger) 0.7861381 0.125 0.8660569
## 10 ns Logistic regression (complex) 0.7749897 0.200 0.8634256
## 11 ns Logistic regression (simple) 0.7756386 0.200 0.8621727
## 12 us Decision tree (simple) 0.7519985 0.500 0.8336048
## 13 us Decision tree (complex) 0.7566659 0.500 0.8260870
## 14 us Logistic regression (simple) 0.7767675 0.500 0.8227039
## 15 us Logistic regression (complex) 0.7767975 0.500 0.8215762
## 16 ds Random forest (ranger) 0.7881108 0.500 0.8209498
## 17 ds Decision tree (complex) 0.7616706 0.500 0.8178173
## 18 ds Logistic regression (complex) 0.7776273 0.500 0.8130560
## 19 ds Logistic regression (simple) 0.7786925 0.500 0.8104248
## 20 ds Decision tree (simple) 0.7515769 0.500 0.8085453
## Sensitivity
## 1 0.4298246
## 2 0.4703947
## 3 0.4616228
## 4 0.4978070
## 5 0.5098684
## 6 0.5350877
## 7 0.4934211
## 8 0.4901316
## 9 0.5592105
## 10 0.5394737
## 11 0.5427632
## 12 0.5964912
## 13 0.6008772
## 14 0.5899123
## 15 0.5932018
## 16 0.6304825
## 17 0.6162281
## 18 0.5997807
## 19 0.6063596
## 20 0.6315789
## sampling Model AUC Threshold Accuracy
## 1 ds Decision tree (simple) 0.7515769 0.500 0.8085453
## 2 ds Random forest (ranger) 0.7881108 0.500 0.8209498
## 3 ds Decision tree (complex) 0.7616706 0.500 0.8178173
## 4 ds Logistic regression (simple) 0.7786925 0.500 0.8104248
## 5 us Decision tree (complex) 0.7566659 0.500 0.8260870
## 6 ds Logistic regression (complex) 0.7776273 0.500 0.8130560
## 7 us Decision tree (simple) 0.7519985 0.500 0.8336048
## 8 us Logistic regression (complex) 0.7767975 0.500 0.8215762
## 9 us Logistic regression (simple) 0.7767675 0.500 0.8227039
## 10 ns Random forest (ranger) 0.7861381 0.125 0.8660569
## 11 ns Logistic regression (simple) 0.7756386 0.200 0.8621727
## 12 ns Logistic regression (complex) 0.7749897 0.200 0.8634256
## 13 us Random forest (ranger) 0.7809281 0.500 0.8676858
## 14 ns Decision tree (simple) 0.7396911 0.300 0.8710688
## 15 smote Decision tree (simple) 0.7628825 0.500 0.8729483
## 16 smote Logistic regression (simple) 0.7697226 0.500 0.8676858
## 17 smote Logistic regression (complex) 0.7701089 0.500 0.8670593
## 18 smote Random forest (ranger) 0.7851047 0.500 0.8780855
## 19 smote Decision tree (complex) 0.7607222 0.500 0.8765819
## 20 ns Decision tree (complex) 0.6926397 0.250 0.8784613
## Sensitivity
## 1 0.6315789
## 2 0.6304825
## 3 0.6162281
## 4 0.6063596
## 5 0.6008772
## 6 0.5997807
## 7 0.5964912
## 8 0.5932018
## 9 0.5899123
## 10 0.5592105
## 11 0.5427632
## 12 0.5394737
## 13 0.5350877
## 14 0.5098684
## 15 0.4978070
## 16 0.4934211
## 17 0.4901316
## 18 0.4703947
## 19 0.4616228
## 20 0.4298246
## sampling Model AUC Threshold Accuracy
## 1 ds Random forest (ranger) 0.7881108 0.500 0.8209498
## 2 ds Decision tree (simple) 0.7515769 0.500 0.8085453
## 3 ds Decision tree (complex) 0.7616706 0.500 0.8178173
## 4 us Decision tree (complex) 0.7566659 0.500 0.8260870
## 5 us Decision tree (simple) 0.7519985 0.500 0.8336048
## 6 ds Logistic regression (simple) 0.7786925 0.500 0.8104248
## 7 ds Logistic regression (complex) 0.7776273 0.500 0.8130560
## 8 us Logistic regression (complex) 0.7767975 0.500 0.8215762
## 9 us Logistic regression (simple) 0.7767675 0.500 0.8227039
## 10 ns Random forest (ranger) 0.7861381 0.125 0.8660569
## 11 ns Logistic regression (simple) 0.7756386 0.200 0.8621727
## 12 ns Logistic regression (complex) 0.7749897 0.200 0.8634256
## 13 us Random forest (ranger) 0.7809281 0.500 0.8676858
## 14 ns Decision tree (simple) 0.7396911 0.300 0.8710688
## 15 smote Decision tree (simple) 0.7628825 0.500 0.8729483
## 16 smote Logistic regression (simple) 0.7697226 0.500 0.8676858
## 17 smote Logistic regression (complex) 0.7701089 0.500 0.8670593
## 18 smote Random forest (ranger) 0.7851047 0.500 0.8780855
## 19 smote Decision tree (complex) 0.7607222 0.500 0.8765819
## 20 ns Decision tree (complex) 0.6926397 0.250 0.8784613
## Sensitivity Perf
## 1 0.6304825 2.081915
## 2 0.6315789 2.071703
## 3 0.6162281 2.050273
## 4 0.6008772 2.027841
## 5 0.5964912 2.026587
## 6 0.6063596 2.023144
## 7 0.5997807 2.012617
## 8 0.5932018 2.007980
## 9 0.5899123 2.002528
## 10 0.5592105 1.984478
## 11 0.5427632 1.947699
## 12 0.5394737 1.942373
## 13 0.5350877 1.937861
## 14 0.5098684 1.890806
## 15 0.4978070 1.868562
## 16 0.4934211 1.854528
## 17 0.4901316 1.847322
## 18 0.4703947 1.818875
## 19 0.4616228 1.799827
## 20 0.4298246 1.738110
1.1.1.4 Social and economic context attributes: